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Lecture 1: Basic Set Theory 
Lecturer: Krishna Jagannathan Scribe: Arjun Bhagoji 


We will begin with an informal and intuitive approach to set theory known as “Naive Set Theory”. 


1.1 What is a set? 


A set can be thought of as a collection of well-defined objects. By well-defined, we mean that an object 
either belongs to a set or it does not. Objects belonging to a set are known as elements of the set. Sets can 
be specified in 2 ways: 


1. Extensional definition-All the elements of the set are listed out explicitly and enclosed within curly 
brackets. E.g., the set of all natural numbers from 1 to 5 may be specified as A = {1, 2,3,4,5}. 


2. Intensional definition-Here, a set is defined in terms of the property which is satisfied by all its members. 
This is also known as the set builder notation. E.g., the set A above may also be defined as A = {a|z: 
x <5,x E€ N}. In general, some set C may defined as C = {x|P(x)}, where P(x) is some property. 


We now define the notion of a subset and use the idea of subsets to define when two sets are identical. 


Definition 1.1 (i) A set A is said to be a subset of (or contained in) another set B if every element of 
A is also an element of B. This is denoted as AC B. Here, B is said to be a superset of A. 


(it) A is a proper subset of B (denoted A C B) if A is a subset of B and there is at least one element in 
B which does not belong to A. 


(iit) Two sets A and B are said to be identical (or equal) if A C B and BC A. In other words, every 
element of A is an element of B, and vice versa. 


Two special sets of interest are: 


1. The universal set U, a set which contains all elements! 


2. The empty set Ø, which has, as its name indicates, no elements. It is a subset of every set including 
itself and a proper subset of every set excluding itself. 


1.2 Operations on sets 


1.2.1 Complement 


Taking the complement of a set is a unary operation (i.e., only one set is operated upon) defined as 


1However, in the usual formulations of set theory, the concept of a universal set leads to a paradox known as Russell’s 
paradox. The interested student may look up this famous paradox, ‘en.wikipedia.org/wiki/Russell’ s_parado’. 
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Definition 1.2 For a set A, its complement is defined as AC £ {x|x ¢ A,x € U}. 


The context for the complement of a set is provided by the universal set U. The Venn diagram representation 
of a set’s complement is 


U 


Figure 1.1: Complement (gray area) of a set A 


1.2.2 Union and Intersection 
Let Z be an abstract index set. Consider a family of sets {A;, i € Z} indexed by Z. 


Definition 1.3 Union: The union of {A;, i E€ T} is defined as 


U A, = {z|x € A; for some j € T}. 
ie 


In words, the union |J;ez Ai is a set consisting of those elements which are elements of at least one of the 
A;’s. 


Definition 1.4 Intersection: The intersection of {A;, i E€ T} is defined as 


N A; = {x|x € A; for every j € T}. 
ieL 


In words, the intersection (],-7 A; is a set consisting of those elements which are elements of all the A,’s. 


Remark: 1.5 When the index set T is a finite set, say T = {1,2,3} the definition of union given above 
coincides with the “middle-school” understanding of unions, t.e., taking the union of sets one-at-a-time. For 
example, Wes A; = A, U A2 U A3. However, this “one-by-one” interpretation completely breaks down when 
the index set T is infinite. For example when T = N, the union U32, A; does not have any interpretation 
in terms of taking unions one by one, till infinity. After all, there is no Ago in the family {A;,i E N}, and 
there is no notion of “limiting unions”. Thus, U7, Ai should be interpreted just as Definition 1.3 says: it 
is the set of all elements contained in at least one of the A;, i € N. 


In order to avoid the (dangerous) temptation to interpret );2, Ai as some sort of a limit of finite, “one- 
by-one” unions, a better notation would be to use Uj;en Ai, instead of the potentially misleading but more 
commonly used notation |J; Ai. 
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The following useful identities related to unions and intersections can be proven easily (do it!) from the 


definitions. 
(A 4) UB- NLB), (1.1) 


icT iET 
and 


(U a) NB = UNB), (1.2) 


icT i€L 
An especially important set of laws regarding the interchangeability of unions and intersections under the 
complement operation are De Morgan’s laws. The two laws are (prove them!): 


1. (Nex Ai) = User Af, that is, the complement of the intersection is the union of the complements. 


2. (Uier Ai)® = Miecz Aj, that is, the complement of the union is the intersection of the complements. 
Finally, the relative complement operation on two sets allows us to “subtract” one from the other. 


Definition 1.6 Relative complement: The relative complement of B in A is defined as A\B £ {a|x € A,x ¢ 
B} = AN B®. Similarly, the relative complement of A in B is defined as B\A £ {x|x € B,x ¢ A} = BN AS. 


U 


Figure 1.2: Relative complement of B in A 


The unary complement operation for a set A can also be understood as the relative complement of A in U, 
the universal set. 


1.2.3 Cartesian products 
A Cartesian product is an operation on sets which returns a product set from multiple sets. 


Definition 1.7 Cartesian product: The Cartesian product of 2 sets A and B is defined as Ax B £ {(x,y) : 
x € A,y € B}, that is, it is the set of all ordered pairs of elements from the two sets, such that the first 
component belongs to A and the second to B. 


For example, if A = {1,2} and B = {a}, then A x B = {(1,a), (2,a)} and B x A= {(a, 1), (a,2)}. Clearly, 
this operation is not commutative. The Cartesian product of n sets A1, A2- -- An is 

Aj x Ag: x An = {(a1,a2°°° ,Gn) 2 ai € Ai} 
If all the n sets are identical, then we get 


A” = { (a1, a2; an): a; E A} 
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1.3 Power sets 


Definition 1.8 Power set: The power set of a set A, denoted as P(A) or 24, is the set of all subsets of A 
including the null set Ú and A itself. 


For example, the power set of A = {1,2} is 


P(A) = {0, {1}, {2}, {1, 2h} 


A power set is an example of a class, which is a collection of sets and is usually denoted by a script letter, 
like so: A. The union and intersection operations extend to classes, as does the idea of subsets, in a suitably 
modified form. 


1.4 Functions 


Definition 1.9 A function f from a set A to another set B is a subset of the Cartesian product (A x B) of 
the sets such that every element of A is the first component of one and only one ordered pair in the subset. 
In simple terms, it is a rule that maps every element from set A to a unique element in set B. It is commonly 
denoted as f : A B and A is known as the domain while B is known as the codomain. 


The element in the codomain (say, b) which is associated with an element in the domain (say, a) is known 
as the image of the element ‘a’, and ‘a’ by itself is called the argument of the function ‘f’ and is also termed 
as pre-image of the element ‘b’. Then, we say f maps a to b and is represented as b = f(a). The range of a 
function is the set of all elements in the co-domain which are images for elements in the domain, hence, it 
is the subset (not necessarily proper subset) of the codomain . Functions can be classified as follows: 


1. Injective: An injective or one-to-one function is one where a 4 b => f(a) # f(b), Va,b € domain(f). 
E.g., function f : N > R defined as f(x) = x, Va € N, is an injective function. 


2. Surjective: A surjective or onto function is one where Vb € codomain(f), Ja € domain(f) such that f(a) = 
b. For example, the following are surjective functions: 
(i) Let A = {1,2,3} and B = {0,1}. The function g : A > B defined as g(1) = 0,g(2) = 0 and 
g(3) = 1 is a surjective function. 
(ii) The function h : R > R defined as h(x) = x, Vax € R is also surjective. 


A function which is both injective and surjective is known as a bijective function (or a bijection). The 
example of function ‘h’ stated above is also a bijective function. An inverse function can be defined for a 
bijection since the mapping is unique and the entire codomain is covered. The notion of a bijection can be 
used to understand the equicardinality of infinite sets, i.e., when can we say that the “size” of two infinte 
sets is equal? This question will be answered in the subsequent lectures. 
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Lecture 2: A crash course in Real Analysis 


Lecturer: Dr. Krishna Jagannathan Scribe: Sudharsan Parthasarathy 


This lecture is an introduction to Real Analysis. Here we introduce the important concepts and theorems 
from Real Analysis that will be useful in the rest of the course. Interested readers may refer the book listed 
in the References section to learn the proofs of the theorems. 


2.1 Notations 


€ - belongs to. 

J - there exists. 

Y - for all. 

=> - implies. 

R - set of real numbers. 

Q- set of rational numbers. 
A - and. 

N - set of natural numbers. 
— - converges to. 

iff- if and only if. 

C - is a subset of. 


o -null set. 
N - intersection. 
i.e. - that is. 


2.2 Field 


A set X is a field if it satisfies the six properties listed below under the two abstract operations ‘+’ and ‘.’. 
Closure: Ifa and b € X, then a+b € X and a.b € X. Hence X is closed under addition and multiplication. 


Commutativity: If a and b € X, then a+b = b+a and a.b = b.a. Hence X is commutative under addition 
and multiplication. 


Associativity: If a, b and c € X, then (a+b)+c = a+(b+c) and (a.b).c = a.(b.c). Hence X is associative 
under addition and multiplication. 


Identity: If a € X, then J elements 0 and 1 in X such that a+0=a and a.1=a. 


Inverse: If a € X, then J elements -a and a~! in X such that a+(-a)=0 and a.a~'=1. Multiplicative 
inverse does not exist if a=0. 


Distributivity: If a, b and c € X, then multiplication is distributive with respect to addition. a.(b + 
c)=a.b+a.c. 


Note that, the elements 0 and 1 are unique. If X C real numbers R, then the elements in the field can also 
be compared. A field whose elements can be compared is called an ordered field. Another example for an 
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ordered field is a set of rational numbers Q. Henceforth we will concentrate only on the real field R. 


2.2.1 Order axioms 


Law of trichotomy: If a, b € R, then a=b ora > bora < b. 
Transitivity: Let a, b, c € R. If a > b and b > c, then a > c. 
Ordering and addition operator: Let a, b,c € R. a > b = a+c > b+c. 


Ordering and product operator: Let a,b,c E€ R. a >b = ac> bc, ifc> 0. 


2.3 Boundedness 


A subset S of R is bounded above if 4 a real number M such that x < M, V x € S. Here, M is called an 
upper bound of S. Similarly S is bounded below if 3 a real number m such that x > m, V x € S. Here, m 
is called a lower bound of S. A set is bounded if it is both bounded above and below. Any element greater 
than M and lesser than m are also upper and lower bounds of S$ respectively. 


Supremum: The supremum of S is the least upper bound of the set S. More precisely, K is a supremum of 
S if 


e K isan upper bound of S, ie., x < K,V@eS. 


e There exists no number less than K which is an upper bound of S, i.e. for any 6 > 0, 3 z € S such 
that z > K-ô. 


Similarly one can define the infimum, as the greatest lower bound of a set. It is important to note that 
supremum and infimum need not be elements of the set. For instance, 1 is the supremum of the set (0,1), 
but is not an element of the set. Also, if the supremum is an element of the set itself, then it is the maximum 
of that set. 


2.4 Completeness property 


The completeness axiom or the least upper bound property is one of the fundamental properties of the real 
field R. 


Completeness Axiom: Any non empty subset A of R which is bounded above has a supremum in R. 


In other words, the Completeness Axiom guarantees that, for any nonempty set of R that is bounded above, 
a supremum exists. Although R and Q are ordered fields, we will see in the exercise below that the latter 
does not satisfy the completeness property. Indeed, completeness along with the ordered field property 
characterizes R. Thus, R is also referred to as a complete ordered field. 


We now list a few important theorems (without proofs), which are consequences of the completeness property. 


Theorem 2.1 [fx and y are any two positive real numbers, then there exists a positive integer m such that 
maz > y. This is called as the Archimedean property of real numbers. 
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Theorem 2.2 Every open interval contains a rational number. 


Theorem 2.3 Letx € R, n > 2, n EN, then 


e Ifx > 0 andn is even then 3 a unique y > 0 such that y” =x. 


e Ifx € R andn is odd then 3 a unique y E€ R such that y” =a. 


2.5 Sequences 


A (real) sequence is a function from N to R. A sequence {xn} of real numbers is said to converge to x € R 
if for every € > 0, J a natural number no such that |v, — x| < € Y n > no. 


Theorem 2.4 Let {zn} be a monotonically increasing sequence such that £n < a for some a € R, and all 
n > 1. Then {£n} converges to a real number. 


In other words, the above theorem can be stated as: a monotonically non-decreasing sequence which is 
bounded above converges. The proof again uses the completeness property. The student is encouraged to 
attempt a proof of this theorem, before referring to a text. 


Corollary 2.5 A convergent sequence is bounded. 


Of course, a bounded sequence need not converge: consider for example, the sequence £n = {(—1)”}. The 
sequence is bounded, but does not converge. Next, we list some elementary properties of limits. Let {£n} 
and {yn} be two sequences that converge to x and y, respectively. 

© Iint+Yn > r+ y. 

eat, > ax, V QER. 

© LnYn `> TY. 

e rn, > 0Yn = r>0. 

© tn SYYN => Ty. 


e Ify #0, 2> 4, 


Theorem 2.6 Sandwich/ Two policeman theorem: 
If £n < Zn < Yn V n and if £n and yn converge to x, then zn also converges to x (Prove it!). 


Examples of some important sequences are as follows: 


Cauchy Sequences: A sequence {xp } is called a Cauchy sequence if V € > 0, I no € N such that |x, — £m] 
<EYn, m> no. 


Subsequences: A subsequence of a sequence is an infinite ordered subset of that sequence. Here are a few 
basic theorems about subsequences. 
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Theorem 2.7 Every real sequence has a monotonic subsequence. 
Theorem 2.8 Bolzano-Weistrass Theorem: Every bounded sequence has a convergent subsequence. 


Theorem 2.9 A sequence {£n} is convergent iff {£n} is bounded and every convergent subsequence of {£n} 
converges to the same limit. 


Theorem 2.10 A real sequence is convergent iff it is a Cauchy sequence. 


2.6 Metric Spaces 


A set X is a metric space if we can associate a real number d(a,b) with any two elements a and b of the set 
X such that 


e d(a,b) > 0 if a 4 b; d(a,a)=0. 
e d(a,b)=d(b, a). 


e Triangle inequality: d(a,b) < d(a,c) + d(b, c) for any c € X. 


Any function d that satisfies these properties on a set is called a metric. 


2.6.1 Open set 


Let (X,d) be a metric space. The open ball B(x,r) centred at x of radius r is defined as B(x,r) = 
d(x,y) <r}. A set A C X is said to be open in X if for every x € A, I r > 0 such that B(x,r) C A. 


Theorem 2.11 Let (X,d) be a metric space, then 


e X and the null set ọ are open in X. 
e An arbitrary union of open sets is open. 


e A finite intersection of open sets is open. 


Definition 2.12 Interior point: Let (X,d) be a metric space and A C X. A point x E€ X is called an 
interior point of A if there exists r > 0, such that B(x,r) C A. 


Let A? denote the set of all interior points of A. Clearly, A? C A. 
Lemma 2.13 Let (X,d) be a metric space, then 


e A? is open in X. 
e A?’ is the largest open set contained in A. 


e A°=A iff A is open. 
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2.6.2 Closed set 
Let (X,d) be a metric space and A C X. A is said to be closed in X if A° is open in X. 
Theorem 2.14 Let (X,d) be a metric space, then 


e X and the null set ọ are closed in X. 
e An arbitrary intersection of closed sets is closed. 


e A finite union of closed sets is closed. 


Definition 2.15 Limit point: Let (X,d) be a metric space and A C X. A point « € X is called a limit 
point of A, if for every r > 0, B(x,r) contains at least one point of A. 


The closure of A, denoted <A, is defined as the set of all limit points of A. Clearly, A C A. 


Lemma 2.16 Let (X,d) be a metric space, then 


e A is closed in X. 
e A is the smallest closed set containing A. 


e A=A iff A is closed. 


2.6.3 Compact set 


A subset A of a metric space X is compact if every sequence in A has a convergent subsequence in A. 


Theorem 2.17 Heine-Borel Theorem: 
In any Euclidean space R?, a set A is compact iff it is closed and bounded. 


2.7 Functions 


A function f maps every element in set A to a unique element in set B. Let (X, dx) and (Y, dy) be two 
metric spaces. Let A be a subset of X and a € A, and let f be a function from A to Y. The function f is 
said to be continuous at a if for every € > 0 there exists a ô > 0 such that dy (f(x), f(a)) < e for all points 
x € A for which dx(x,a) < ô. If f is continuous at every point on X, then f is continuous on X. 


Theorem 2.18 f is continuous iff f(an) converges to f(x) in Y whenever the sequence £n converges to x 
in X. 


Theorem 2.19 A function f that maps a metric space X into a metric space Y is continuous on X iff 
f-\(B) is open in X for every open set B in Y. (f~'(B) is the inverse image of set B. fT} does not mean 
inverse function here.) 
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Theorem 2.20 A function f that maps a metric space X into a metric space Y is continuous on X iff 
f-1(B) is closed in X for every closed set B in Y. 


Theorem 2.21 If function f is a continuous mapping of a compact metric space X into a metric space Y, 
then f(X) is compact. 


2.8 Exercises 


1. Prove the uniqueness of the supremum and infimum of a set. 


2. Let S = {x : x € Q,z > OAx? < 2} be a subset of Q. Show that S has no rational supremum. 
This shows that the completeness axiom does not hold for Q. 


3. Let f and g be continuous functions on metric space X, then f +g and fg are continuous on X. 


References 


[WR] WALTER RUDIN, “Principles of Mathematical Analysis,” McGraw Hill International Series, 
Third Edition. 
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Lecture 3: Cardinality and Countability 


Lecturer: Dr. Krishna Jagannathan Scribe: Ravi Kiran Raman 


3.1 Functions 


We recall the following definitions. 


Definition 3.1 A function f : A > B is a rule that maps every element of set A to a unique element in 
set B. 


In other words, Vx € A, Jy € B and only one such element, such that, f(x) = y. Then y is called the image 
of x and a, the pre-image of y under f. The set A is called the domain of the function and B, the co-domain. 
R = {y : Jx € A, s.t. f(x) = y} is called as the range of the function f. 


Definition 3.2 A function f : A— B is said to be an injective (one-to-one) function, if every element 
in the range R has a unique pre-image in A. 


Definition 3.3 A function f : A —> B is said to be a surjective (onto) function, if R = B, i.e, 
Vy € B, Jx € A, s.t. f(x) = y. 


Definition 3.4 A function f : A —> B is a bijective function if it is both injective and surjective. 


Hence, in a bijective mapping, every element in the co-domain has a pre-image and the pre-images are 
unique. Thus, we can define an inverse function, f7! : B > A, such that, f~!(y) = x, if f(x) = y. In simple 
terms, bijective functions have well-defined inverse functions. 


3.2 Cardinality and Countability 


In informal terms, the cardinality of a set is the number of elements in that set. If one wishes to compare the 
cardinalities of two finite sets A and B, it can be done by simply counting the number of elements in each 
set, and declare either that they have equal cardinality, or that one of the sets has more elements than the 
other. However, when sets containing infinitely many elements are to be compared (for example, N versus Q), 
this elementary approach is not efficient to do it. In the late nineteenth century, Georg Cantor introduced 
the idea of comparing the cardinality of sets based on the nature of functions that can be possibly defined 
from one set to another. 


Definition 3.5 (i) Two sets A and B are equicardinal (notation |A| = |B|) if there exists a bijective 
function from A to B. 


(ii) B has cardinality greater than or equal to that of A (notation |B| > |A|) if there exists an injective 
function from A to B. 
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(iii) B has cardinality strictly greater than that of A (notation |B| > |A|) if there is an injective function, 
but no bijective function, from A to B. 


Having stated the definitions as above, the definition of countability of a set is as follow: 


Definition 3.6 A set E is said to be countably infinite if E and N are equicardinal. And, a set is said 
to be countable if it is either finite or countably infinite. 


The following are some examples of countable sets: 
1. The set of all integers Z is countably infinite. 
We can define the bijection f : Z — N as follows : 


n=f(z)EN]|zeZ 
1 0 
2 +1 
3 -1 
4 +2 
5 -2 


The existence of this bijective map from Z to N proves that Z is countably infinite. 


2. The set of all rationals in [0,1] is countable. 
Consider the rational number A where q # 0. Increment q in steps of 1 starting with 1. For each such q and 


0 < p< q, add the rational number Ë to the set, if it not already present. By this way, the set of rational 


q 
numbers in [0,1] can be explicitly listed as: {0, 1, 5, L, 2, F, 3, T 2, 3, 4, t, 5, iF 


Clearly, we can define a bijection from QN [0,1] — N where each rational number is mapped to its index in 
the above set. Thus the set of all rational numbers in [0,1] is countably infinite and thus countable. 


3. The set of all Rational numbers, Q is countable. 
In order to prove this, we state an important theorem, whose proof can be found in [1]. 


Theorem 3.7 Let I be a countable index set, and let E; be countable for each i € T. Then User E; is 
countable. More glibly, it can also be stated as follows: A countable union of countable sets is countable. 


We will now use this theorem to prove the countability of the set of all rational numbers. It has been already 
proved that the set QN [0,1] is countable. Similarly, it can be showed that QN[n, n+ 1] is countable, Vn € Z. 
Let Qi = QNA fi,i + 1]. Thus, clearly, the set of all rational numbers, Q = UjezQ; — a countable union of 
countable sets — is countable. 


Remark: For two finite sets A and B, we know that if A is a strict subset of B, then B has cardinality 
greater than that of A. As the above examples show, this is not true for infinite sets. Indeed, N is a strict 
subset of Q, but N and Q are equicardinal! 


4. The set of all algebraic numbers (numbers which are roots of polynomial equations with rational co- 
efficients) is countable. 


5. The set of all computable numbers, i.e., real numbers that can be computed to within any desired 
precision by a finite, terminating algorithm, is countable (see Wikipedia article for more details). 


Lecture 3: Cardinality and Countability 3-3 


Definition 3.8 A set F is uncountable if it has cardinality strictly greater than the cardinality of N. 


In the spirit of Definition 3.5, this means that F is uncountable if an injective function from N to F exists, 
but no such bijective function exists. 


An interesting example of an uncountable set is the set of all infinite binary strings. The proof of the 
following theorem uses the celebrated ‘diagonal argument’ of Cantor. 


Theorem 3.9 (Cantor) : The set of all infinite binary strings, {0,1}, is uncountable. 


Proof: It is easy to show that an injection from N to {0,1} exists (exercise: produce one!). We need to 
show that no such bijection exists. 


Let us assume the contrary, i.e, let us assume that the set of all binary strings, A = {0,1} is countably 
infinite. Thus there exists a bijection f : A > N. In other words, we can order the set of all infinite binary 
strings as follows: 


Q11 Q12 Q13 
Q21 Q22 Q23 


a a a . E zhas mee 
Ent, tieqee ee where, a;; is the jt} bit of the it” binary string, i, j > 1. 


Consider the infinite binary string given by @ = aj1a22433..., where dij is the complement of the bit aij. 


Since our list contains all infinite binary strings, there must exist some k € N such that the string @ occurs 
at the k position in the list, i.e., f(a) = k. The kt” bit of this specific string is agp. However, from the 
above list, we know that the kt” bit of the kt” string is axx. Thus, we can conclude that the string @ cannot 
occur in any position k > 1 in our list, contradicting our initial assumption that our list exhausts all possible 
infinite binary strings. 


Thus, there cannot possibly exist a bijection from N to {0,1}, proving that {0,1}°° is uncountable. E 


Now using Cantor’s theorem, we will prove that the set of irrational numbers is uncountable. 
Claim 3.10 The sets [0,1], R and {R \ Q} are uncountable. 


Proof: Firstly, consider the set [0,1]. Any number in this set can be expressed by its binary equivalent 
and thus, there appears to be a bijection from [0,1] > {0,1}°°. However, this is not exactly a bijection 
as there is a problem with the dyadic rationals (i.e., numbers of the form şs, where a and b are natural 
numbers, and a is odd). For example, 0.01000.... in binary is the same as 0.001111...... However we can 
tweak this “near bijection” to produce an explicit bijection in the following way. For any infinite binary 


string x = (£1, %2,...) E€ {0,1}™, let 
g(x) = 5 Ep2". 
k=1 


The function g maps {0,1} “almost bijectively” to [0,1], but unfortunately, the dyadic rationals have two 
pre-images. For example we have g(1000...) = g(0111...) = 4. To fix this let the the set of dyadic rationals 
be diven by the list 


OO «& 


1 1 3 1 3 5 7 
D fa 542 p% p% z gt st 
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Note that the dyadic rationals can be put in a list as given above as they are countable. Next, we define the 
following bijection f(x) from {0,1}° to [0,1]. 


g(x) if g(x) ED, 
f(@) = ¢ don_1 if g(x) =d, for some n € N and a, terminates in 1, 
don, if g(x) = dn for some n € N and x, terminates in 0. 


This is an explicit bijection from {0,1}° to [0,1] which proves that the set [0,1] is uncountable. (Why?) 


Next, we can define a bijection from (0,1) —> R, for instance using the function tan(m~ — 5), x € (0,1). 
Thus the set of all real numbers, R is uncountable. 


Finally, we can write, R = QU {R \ Q}. Since Q is countable and R is uncountable, we can easily argue that 
{R \ Q}, i.e, the set of all irrational numbers, is uncountable. | 


3.3 Exercises 


1. Prove that 2N, the power set of the natural numbers, is uncountable. (Hint: Try to associate an infinite 
binary string with each subset of N.) 


2. Prove that the Cartesian product of two countable sets is countable. 


3. Let A be a countable set, and Bn be the set of all n-tuples (a1,...,dn), where a, E A(k = 1,2,...,n) 
and the elements aj, @2,...,@, need not be distinct. Show that Bn is countable. 


4. Show that an infinite subset of a countable set is countable. 


5. A number is said to be an algebraic number if it is a root of some polynomial equation with integer 
coefficients. For example, V2 is algebraic since it is a root of the polynomial x? — 2. However, it is known 
that m is not algebraic. Show that the set of all algebraic numbers is countable. Also, a transcendental 
number is a real number that is not algebraic. Are the transcendental numbers countable? 


6. The Cantor set is an interesting subset of [0, 1], which we will encounter several times in this course. 
One way to define the Cantor set C is as follows. Consider the set of all real numbers in [0,1] written down 
in ternary (base-3) expansion, instead of the usual decimal (base-10) expansion. A real number x € [0,1] 
belongs to C iff x admits a ternary expansion without any 1s. Show that C is uncountably infinite, and that 
it is indeed equi-cardinal with [0, 1]. 
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Lecture 4: Probability Spaces 
Lecturer: Dr. Krishna Jagannathan Scribe: Jainam Doshi, Arjun Nadh and Ajay M 


4.1 Introduction 


Just as a point is not defined in elementary geometry, probability theory begins with two entities that are 
not defined. These undefined entities are a Random Experiment and its Outcome. These two concepts 
are to be understood intuitively, as suggested by their respective English meanings. We use these undefined 
terms to define other entities. 


Definition 4.1 The Sample Space Q of a random experiment is the set of all possible outcomes of a 
random experiment. 


An outcome (or elementary outcome) of the random experiment is usually denoted by w. Thus, when a 
random experiment is performed, the outcome w € Q is picked by the Goddess of Chance or Mother Nature 
or your favourite genie. 


Note that the sample space Q can be finite or infinite. Indeed, depending on the cardinality of Q, it can be 


classified as follows: 


1. Finite sample space 
2. Countably infinite sample space 


3. Uncountable sample space 


It is imperative to note that for a given random experiment, its sample space is defined depending on what 
one is interested in observing as the outcome. We illustrate this using an example. Consider a person tossing 
a coin. This is a random experiment. Now consider the following three cases: 


e Suppose one is interested in knowing whether the toss produces a head or a tail, then the sample space 
is given by, Q = {H,T}. Here, as there are only two possible outcomes, the sample space is said to be 
finite. 


e Suppose one is interested in the number of tumbles before the coin hits the ground, then the sample 
space is the set of all natural numbers. In this case, the sample space is countably infinite and is given 
by, Q =N. 


e Suppose one is interested in the speed with which the coin strikes ground, then the set of positive real 
numbers forms the sample space. This is an example of an uncountable sample space, which is given 
by, Q= R+. 
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Thus we see that for the same experiment, Q can be different based on what the experimenter is interested 
in. 


Let us now have a look at one more example where the sample space can be different for the same experiment 
and you can get different answers based on which sample space you decide to choose. 


Bertrand’s Paradox: Consider a circle of radius r. What is the probability that the length of a chord chosen 
at random is greater than the length of the side of an equilateral triangle inscribed in the circle? 


This is an interesting paradox and gives different answers based on different sample spaces. The entire 
description of Bertrand’s Paradox can be found in [1]. 


Definition 4.2 (Informal) An event is a subset of the sample space, to which probabilities will be assigned. 


An event is a subset of the sample space, but we emphasise that not all subsets of the sample space are 
necessarily considered events, for reasons that will be explained later. Until we are ready to give a more 
precise definition, we can consider events to be those “interesting” subsets of Q, to which we will eventually 
assign probabilities. We will see later that whenever Q is finite or countable, all subsets of the sample space 
can be considered as events, and be assigned probabilities. However, when Q is uncountable, it is often not 
possible to assign probabilities to all subsets of Q, for reasons that will not be clear now. The way to handle 
uncountable sample spaces will be discussed later. 


Definition 4.3 An event A is said to occur if the outcome w, of the random experiment is an element of 
A, ie., ifw EA. 


Let us take an example. Say the random experiment is choosing a card at random from a pack of playing 
cards. What is the sample space in this case? It is a 52 element set as each card is a possible outcome. As 
the sample space is finite, any subset of the sample space can be considered as an event. As a result there 
will be 25? events (Power set of n elements has 2” elements). An event can be any subset of the sample 
space which includes the empty set, all the singleton sets (containing one outcome) and collection of more 
than one outcomes. Listed below are a few events: 

e The 7 of Hearts (1 element) 

e A face Card (12 elements) 

e A 2 and a7 at the same time (0 element) 


e An ace of any color (4 elements) 


e A diamond card (13 elements) 
Next, let us look at some nice properties that we would expect events to satisfy: 


e Since the sample space 2 always occurs, we would like to have 2 as an event. 


e If Ais an event (i.e., a “nice” subset of the sample space to which we would like to assign a probability), 
it is reasonable to expect A‘ to be an event as well. 


e If A and B are two events, we are interested in the occurrence of at least one of them (A or B) as well 
as the occurrence of both of them (A and B). Hence, we would like to have AU B and AN B to be 
events as well. 


The above three properties motivate a mathematical structure of subsets, known as an algebra. 
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4.2 Algebra, Fo 


Let Q be the sample space and let Fo be a collection of subsets of Q. Then, Fo is said to be an algebra (or 
a field) if 
i. 0 € Fo. 
ii. A € Fo, implies A° € Fo. 
iii. A € Fo and B € Fo implies AUB € Fo. 
It can be shown that an algebra is closed under finite union and finite intersection (see Exercise 1(a)). 


However, a natural question that arises at this point is “Is the structure of an algebra enough to study 
events of typical interest?” Consider the following example: 


Example:- Toss a coin repeatedly until the first heads shows. Here, Q = {H, TH, TTH, ... }. Let us say that 
we are interested in determining if the number of tosses before seeing a head is even. It is easy to see that 
this ‘event’ of interest will not be included in the algebra. This is because an algebra contains only finite 
unions of subsets, but the ‘event’ of interest entails a countably infinite union. This motivates the definition 
of a o-algebra. 


4.3 Sigma Algebra, F 


A collection F of subsets of Q is called a o-algebra (or o-field) if 


i. OE F. 
ii. A € F, implies AS € F. 


iii. If Ay, Ag, Ag, ... is a countable collection of subsets in F, then [J Ai E€ F. 
i=l 


Note that unlike an algebra, a o-algebra is closed under countable union and countable intersection (see 
Exercise 1(b)). Some examples of o-algebra are: 


i. {0,9} 
ii. {0, A, AS, Q} 
iii. Power set of Q, denoted by 2°. 
The 2-tuple (Q, F) is called a measurable space. Also, every member of the o-algebra F is called an F- 
measurable set in the context of measure theory. In the specific context of probability theory, F-measurable 


sets are called events. Thus, whether or not a subset of 2 is considered an event depends on the o-algebra 
that is under consideration. 


4.4 Measure 


We now proceed to define measures and measure spaces. We will see that a probability space is indeed a 
special case of a measure space. 
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Definition 4.4 Let (Q,7) be a measurable space. A measure on (Q, F) is a function u: F — [0,co] such 
that 


i. (0) =0. 


ii. If {Aj,i > 1} is a sequence of disjoint sets in F, then the measure of the union (of countably infinite 
disjoint sets) is equal to the sum of measures of individual sets, i.e., 


i (U a) = Sut) (4.1) 


The second property stated above is known as the countable additivity property of measures. From the 
definition, it is clear that a measure can only be assigned to elements of F. The triplet (Q, F, u) is called a 
measure space. u is said to be a finite measure if u(Q) < co; otherwise, u is said to be an infinite measure. 
In particular, if u(Q) = 1, then pu is said to be a probability measure. Next, we state this explicitly for 
pedagogical completeness. 


4.5 Probability Measure 


A probability measure is a function P: F —> [0,1] such that 


i. PO) =0. 
i. P(Q) =1. 


iii. (Countable additivity:) If {A;,i > 1} is a sequence of disjoint sets in F, then 
P (U a) =ŅX_P(4) 
i=1 i=1 


The triplet (0,7,P) is called a probability space, and the three properties, stated above, are sometimes 
referred to as the axioms of probability. 


Note:- It is clear from the definition that probabilities are defined only to elements of F, and not necessarily 
to all subsets of Q. In other words, probability measures are assigned only to events. Even when we speak 
of the probability of an elementary outcome w, it should be interpreted as the probability assigned to the 
singleton set {w} (assuming of course, that the singleton is an event). 


4.6 Exercises 


1. a) Let Aj, Ag,..., An be a finite collection of subsets of Q such that A; € Fo (an algebra), 1 <i <n. 
Show that U A; € Fo and N A; € Fo. Hence, infer that an algebra is closed under finite union 
and finite AN ma 

b) Suppose A;, Ag, Az, ... is a countable collection of subsets in the o-algebra F, then show that 


A; € F. 
i=1 


i= 
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2. [o-algebra : Properties and Construction]. 


(a) Show that a o-algebra is also an algebra. 
(b) Given a sample space Q and a o-algebra F of the subsets of Q, show that if A,B € F, A\ B and 
A A B, the symmetric difference of A and B are present in F. 


(c) Consider the random experiment of throwing a die. If a statistician is interested in the occurrence 
of either an odd or an even outcome, construct a sample space and a o-algebra of subsets of this 
sample space. 


(d) Let Aj, A2, ..., An be arbitrary subsets of Q. Describe (explicitly) the smallest o-algebra F con- 
taining A1, Ao,..., An. How many sets are there in F? (Give an upper bound that is attainable 
under certain conditions). List all the sets in F for n = 2. 


3. Let F and G be two o-algebras of subsets of 2. 


(a) Is FUG, the collection of subsets of Q lying in either F or G a o-algebra? 
(b) Show that F N G, the collection of subsets of Q lying in both F and G is a o-algebra. 


(c) Generalize (b) to arbitrary intersections as follows. Let Z be an arbitrary index set (possibly 
uncountable), and let {F;}:iez be a collection of o-algebras on Q. Show that (A F; is also a 
ieL 
o-algebra. 


4. Let F be a o-algebra of subsets of Q, and let B € F. Show that 
G={ANB|AeEF} 
is a o-algebra of subsets of B. 


5. Let X and Y be two sets and let f : X —> Y be a function. If F is a o-algebra over the subsets of Y 
and G = { A | 3 B € F such that f-!(B) = A}, does G form a o-algebra of subsets of X ? 
Note that f~1() is the notation used for the pre-image of set N under the function f for some N C Y. 
That is, f-1(N) = {x € X|f(x) € N} for some N CY. 


6. Let Q be an arbitrary set. 


(a) Is the collection F, consisting of all finite subsets of Q an algebra? 


(b) Let Fə consist of all finite subsets of Q, and all subsets of Q having a finite complement. Is Fz 
an algebra? 


(c) Is Fy a o-algebra? 


(d) Let F3 consist of all countable subsets of Q, and all subsets of Q having a countable complement. 
Is F; a o-algebra? 
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Lecture 5: Properties of Probability Measures 
Lecturer: Dr. Krishna Jagannathan Scribe: Ajay M, Gopal Krishna Kamath M 


5.1 Properties 


In this lecture, we will derive some fundamental properties of probability measures, which follow directly 
from the axioms of probability. In what follows, (Q, F, P) is a probability space. 


e Property 1:- Suppose A be a subset of Q such that A € F. Then, 
P(A‘) = 1 — P(A). (5.1) 


Proof:- Given any subset A € Q, A and A® partition the sample space. Hence, A° U A = Q and 
A°N A=9. By the ” Countable Additivity” axiom of probability, P(A° U A) = P(A) + P(A‘) 
=> P(Q) = P(4)+P(4°) = P(AS) = 1 — P(A). 


Property 2:- Consider events A and B such that A C B and A, B € F. Then P(A) < P(B) 
Proof:- The set B can be written as the union of two disjoint sets A and A°N B. Therefore, we have 
P(A) + P(ASN B) = P(B) = P(A) < P(B) since P(ASN B) > 0. 


Property 3:- (Finite Additivity) If A,, Ao, ..., An are finite number of disjoint events, then 


P (U a = 5 P(A). (5.2) 


Proof:- This property follows directly from the axiom of countable additivity of probability measures. 
It is obtained by setting the events Ani1, An+2, ... as empty sets. LHS will simplify as: 


Ua) 


RHS can be manipulated as follows: 


= ORN 
Dae) = lm D RA 
n k 
= ) P(A) + jim >, P(A;) 
i=l i=n+1 
2 P(A;) + lim 0 
k-oo 
i=l 
= 5 P(A;) 
w=1 


where (a) follows from the definition of an infinite series and (b) is a consequence of setting the events 
from Anı onwards to null sets. 
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Property 4:- For any A, BEF, 
P(AU B) = P(A) + P(B) —P(AN B). (5.3) 


In general, for a family of events {A;}?_, C F, 


P (U a) = S5 P(A) -ẸRAN Ay) + YO P(A A;N Ar) +. + (1) HP (À a) . (5.4) 


i<j i<j<k 


This property is proved using induction on n. The property can be proved in a much more simpler way 
using the concept of Indicator Random Variables, which will be discussed in the subsequent lectures. 
Proof of Eq (5.3):- The set AU B can be written as AU B = AU (A° N B). Since A and ACN B are 
disjoint events, P(AU B) = P(A)+P(A°NB). Now, set B can be partitioned as, B = (AA B)U(ASNB). 
Hence, P(B) = P(A N B) + P(A° N B). On substituting this result in the expression of P(A U B), we 
will obtain the final result that P(A U B) = P(A) + P(B) —P(AN B). 


Property 5:- If {A;,i > 1} are events, then 


P (U a = lim P (ú a) l (5.5) 


i=1 
This result is known as continuity of probability measures. 
n—-1 
Proof:- Define a new family of sets Bı = A1, By = Ap \ Al, ..., Bn = An\ U Ai 
i=1 


j= 


Then, the following claims are placed: 
i=1 i=1 
Since {B;,i > 1} is a disjoint sequence of events, and using the above claims, we get 


+s) -+ (Ja) -Sre» 


gl 


Therefore, 


® um r(Üz) 
2 im e(Ca). 


Here, (a) follows from the definition of an infinite series, (b) follows from Claim 1 in conjunction with 
Countable Additivity axiom of probability measure and (c) follows from the intermediate result required 
to prove Claim 2. 

Hence proved. 


= 
Nari 
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e Property 6:- If {A;,i > 1} is a sequence of increasing nested events i.e. A; C Aj4i, Vi > 1, then 
P (Ù a) = lim P (Am). (5.6) 
i=1 
e Property 7:- If {A;,i > 1} is a sequence of decreasing nested events i.e. A;+ı C A;Vi > 1, then 


P (À a) = lim P(Am). (5.7) 


Properties 6 and 7 are said to be corollaries to Property 5. 


Property 8:- Suppose {4;,i > 1} are events, then 


P (U ai) < > P(A). (5.8) 


This result is known as the Union Bound. This is bound is trivial if X572; P(A;) > 1 since the LHS of 
(5.8) is a probability of some event. This is a very widely used bound, and has several applications. 
For instance, the union bound is used in the probability of error analysis in Digital Communications 
for complicated modulation schemes. 


n-1 
Proof:- Define a new family of sets Bı = A1, B2 = Ao \ Ai, ..., Bn = An\ U Ai, -. - 
i=l 
Claim 1:- B; N B; = 0, Vi Fx j. 


i= i=l 
Since {B;,i > 1} is a disjoint sequence of events, and using the above claims, we get 


e (üa) -+ (Ja) -Erw 


Also, since B; C A; Vi > 1, P(B;) < P(A;) Vi > 1 (using Property 2). Therefore, the finite sum of 


probabilities follow 


5 P(B,) < oe P(Aj). 


i=l 


Eventually, in the limit, the following holds: 


> P(B) < $ P(4:) 


Finally we arrive at the result, 


5.2 Exercises 


1. a) Prove Claim 1 and Claim 2 stated in Property 5. 
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b) Prove Properties 6 and 7, which are corollaries of Property 5. 


. A standard card deck (52 cards) is distributed to two persons: 26 cards to each person. All partitions 


are equally likely. Find the probability that the first person receives all four aces. 


. Consider two events A and B such that P(A) > 1 — ô and P(B) > 1 — ô, for some very small ô > 0. 


Prove that P(A NM B) is close to 1. 


. [Grimmett] Given events A1, A2, ...An, prove that, 


P(UigrsnAr) < min ( D> P(A) — SD P(4r Ax) 


1<k< 
1l<r<n rirZk 


. Consider a measurable space (Q, F) with Q = [0,1]. A measure P is defined on the non-empty subsets 


of Q (in F), which are all of the form (a,b), (a, 6], [a,b) and [a,b], as the length of the interval, i.e., 
P((a, b)) = P((a, b]) = P([a, b)) = P([a, b]) = b — a. 
a) Show that P is not just a measure, but its a probability measure. 


b) Let A, = l= 1] and Bn = (0, sal: for n > 1. Compute P(U;en Aj), P(Nien Ai), P(UienB;) and 


P(NienBi). 
c) Compute P(Nien( B$ U A$)). 
d) Let Cm = [0, +] such that P(Cm) = P(An). Express m in terms of n. 
e) Evaluate P(Nien(Ci N 4:)) and P(Ujen(Ci N A;)). 


. [Grimmett] You are given that at least one of the events An, 1 < n < N, is certain to occur. However, 


certainly no more than two occur. If P(A,) = p and P(A; N Am) = q, m Æ n, then show that p > $ 
and q < 2. 
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Lecture 6: Discrete Probability Spaces 


Lecturer: Dr. Krishna Jagannathan Scribe: Ravi Kolla 


6.1 Discrete Probability Spaces 


In this lecture, we discuss discrete probability spaces. This corresponds to the case when the sample space 
Q is countable. This is the most conceptually straightforward case, since it is possible to assign probabilities 
to all subsets of 2. 


Definition 6.1 A probability space (Q,F,P) is said to be a discrete probability space if the following condi- 
tions hold: 

(a) The sample space Q is finite or countably infinite, 

(b) The o-algebra is the set of all subsets of Q, i.e., F = 2°, and 


(c) The probability measure, P, is defined for every subset of Q. In particular, it can be defined in terms 
of the probabilities P ({w}) of the singletons corresponding to each of the elementary outcomes w, and 


satisfies for every A € F, 
P(A) = X P({w}), 
wEA 


and 


X P({w}) = 1. 
weEQ 
6.1.1 Examples of Discrete Probability Space 


1. Let us consider a coin toss experiment with the probability of getting a head as p and the probability 
of getting a tail as (1 — p). Then, the sample space and the o-algebra are 


Q = {H,T} = {0,1}, F= 2° = {®, {H}, {T}, {Qh}. 
respectively. The probability measure is 


P({H}) =P ({0}) =p, 
P({T}) = P({1}) =1—-p. 


In this case, we say that P(.) is a Bernoulli measure on ({0, 1}, 2£°1}). 


2. Let Q = N, F = 2N. Then, we can define the probability of a singleton as 
P({k}) = ax > 0,k EN 
under the constraint that 


VP} = 1. 


ken 
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sr, k €N is a valid measure, since 


1 
Doa 


yet 


For example, a, = 


As another example, consider a, = (1 — p p, 0<p<1, k EN. This is known as a geometric 
measure with parameter p. It is a valid probability measure since 


dod =p) pal. 


ken 


3. Let Q = NU {0}, F = 2®. Let us define 


eo >*\k 
P({k}) =, A>0. 


This probability measure is called a Poisson measure with parameter À on (Q, Oi This is a valid 
probability measure, since 


Pe R —A\k oF AE 
Pde) =o eer Gat 
k=0 k=0 k=0 


—— 


er 


4. Let Q = {0,1,2,---,N}, NEN F=2". Let us define 
PN \ ka N-k 
P({k})=( J a-p", 0<p<1. 


This probability measure is called a Binomial measure with parameters (N, p) on (Q, 2): This can be 
verified to be a valid probability measure as follows: 


Ds ( P Ja- =p+1-p)"=1 


kEQ 


Note that in all the examples above, we have not explicitly specified an expression for P (A) for every A C Q. 
Since the sample space is countable, the probability of any subset of the sample space can be obtained as 
the sum of probabilities of the corresponding elementary outcomes. In other words, for discrete probability 
spaces, it suffices to specify the probabilities of singletons corresponding to each of the elementary outcomes. 


6.2 Exercises 


1. An urn contains a number of white balls and b number of black balls. Balls are drawn randomly from the 
urn without replacement. Find the probability that a white ball is drawn at the kth draw. 


2. An urn contains white and black balls. When two balls are drawn without replacement, suppose the 
probability that both the balls are white is 3. 

(a) Find the smallest number of balls in the urn. 

(b) How small can the total number of balls be if the number of black balls is even? 
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3. Consider the sample space Q = N. Find the values of the constant C' for which the following are 
probability measures: 


(a) f(x) = 0277 
b) f(z) = Z= 


4. Recall the Poisson measure on (Q,2°), where Q = NU {0}. What is the probability assigned to the 
set of odd numbers? Prime numbers? 
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Lecture 7: Borel Sets and Lebesgue Measure 
Lecturer: Dr. Krishna Jagannathan Scribes: Ravi Kolla, Aseem Sharma, Vishakh Hegde 


In this lecture, we discuss the case where the sample space is uncountable. This case is more involved than 
the case of a countable sample space, mainly because it is often not possible to assign probabilities to all 
subsets of Q. Instead, we are forced to work with a smaller o-algebra. We consider assigning a “uniform 
probability measure” on the unit interval. 


7.1 Uncountable sample spaces 


Consider the experiment of picking a real number at random from Q = [0,1], such that every number is 
“equally likely” to be picked. It is quite apparent that a simple strategy of assigning probabilities to singleton 
subsets of the sample space gets into difficulties quite quickly. Indeed, 


(i) If we assign some positive probability to each elementary outcome, then the probability of an event 
with infinitely many elements, such as A = {1, 3 Z, --- }, would become unbounded. 

(ii) If we assign zero probability to each elementary outcome, this alone would not be sufficient to determine 
the probability of a uncountable subset of Q, such as | 2] . This is because probability measures are 


not additive over uncountable disjoint unions (of singletons in this case). 


Thus, we need a different approach to assign probabilities when the sample space is uncountable, such as 
Q = [0,1]. In particular, we need to assign probabilities directly to specific subsets of Q. Intuitively, we 
would like our ‘uniform measure’ u on [0,1] to possess the following two properties. 


(i) u ((a,b)) = u ((a, b]) = n (a, b)) = n (La, b]) 


(ii) Translational Invariance. That is, if A € [0,1], then for any z € Q, (A zx) = u (A) where, the set 
A È z is defined as 


A@u={ata2lac A,a+e<1}U{a+x-— l]a € A,at+z> 1} 


However, the following impossibility result asserts that there is no way to consistently define a uniform 
measure on all subsets of [0, 1]. 


Theorem 7.1 (Impossibility Result) There does not exist a definition of a measure u (A) for all subsets 
of [0,1] satisfying (i) and (ii). 


Proof: Refer proposition 1.2.6 in [1]. 


Therefore, we must compromise, and consider a smaller o-algebra that contains certain “nice” subsets of 
the sample space [0,1]. These “nice” subsets are the intervals, and the resulting o-algebra is called the 
Borel o-algebra. Before defining Borel sets, we introduce the concept of generating o-algebras from a given 
collection of subsets. 
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7.2 Generated o-algebra and Borel sets 


The o-algebra generated by a collection of subsets of the sample space is the smallest o-algebra that contains 
the collection. More formally, we have the following theorem. 


Theorem 7.2 LetC be an arbitrary collection of subsets of Q, then there exists a smallest a-algebra, denoted 
by a (C), that contains all elements of C. That is, if H is any o-algebra such that C CH, then o (C) CH. 
o (C) is called the a-algebra generated by C. 


Proof: Let {F;, i € T} denote the collection of all o-algebras that contain C. Clearly, the collection 


{F;, i € T} is non-empty, since it contains at least the power set, 2°. Consider the intersection () F;. Since 
iET 
the intersection of o-algebras results in a o-algebra (homework problem!) and the intersection contains C, 
it follows that () F; is a o-algebra that contains C. Finally, if C C H, then H is one of F;’s for some i € T. 
iET 
Hence () F; is the smallest o-algebra generated by C. E 
iET 
Intuitively, we can think of C as being the collection of subsets of Q which are of interest to us. Then, o(C) 
is the smallest o-algebra containing all the ‘interesting’ subsets. 


We are now ready to define Borel sets. 


Definition 7.3 


(a) Consider Q = (0,1). Let Co be the collection of all open intervals in (0, 1]. Then o (Co) , the o - algebra 
generated by Co, is called the Borel o - algebra. It is denoted by B ((0,1]). 


(b) An element of B((0,1]) is called a Borel-measurable set, or simply a Borel set. 


Thus, every open interval in (0, 1] is a Borel set. We next prove that every singleton set in (0, 1] is a Borel 
set. 


Lemma 7.4 Every singleton set {b}, O < b < 1, is a Borel set, i.e., {b} € B ((0,1]). 
Proof: Consider the collection of sets set {(b — 1, b+ +) n= 1}. By the definition of Borel sets, 


(o- 2.542) € B((0,1)). 


Using the properties of o-algebra, 


zd (A (0 i ‘)) B((0, 1) 
= M (0-242) € B((0,1)). (7.1) 
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Next, we claim that 
CO 


w= AN (0-042). (7.2) 


n=1 


co 
i.e., b is the only element in f (b- 1 b+ +). We prove this by contradiction. Let h be an element 


n=1 


in N (b— b+ +) other than b. For every such h, there exists a large enough no such that h ¢ 
n=1 


(o- b+ +). This implies h ¢ fM (b—+4,6++4). Using (7.1) and (7.2), thus, proves that {b} € 
n=1 
B ((0,1]). E 


As an immediate consequence to this lemma, we see that every half open interval, (a, b], is a Borel set. This 
follows from the fact that 


(a, b] = (a, b) U {b}, 


and the fact that a countable union of Borel sets is a Borel set. For the same reason, every closed interval, 
[a,b], is a Borel set. 


Note: Arbitrary union of open sets is always an open set, but infinite intersections of open sets need not be 
open. 


Further reading for the enthusiastic: (try Wikipedia for a start) 


e Non-Borel sets 
e Non-measurable sets (Vitali set) 


e Banach-Tarski paradox (a bizzare phenomenon about cutting up the surface of a sphere. See https: 
//www . youtube .com/watch?v=Tk4ubu7B1Sk 


e The cardinality of the Borel o-algebra (on the unit interval) is the same as the cardinality of the 
reals. Thus, the Borel o-algebra is a much ‘smaller’ collection than the power set 211], See https: 
//math.dartmouth. edu/archive/m103f£08/public_html/borel-sets-soln. pdf 


7.3 Caratheodory’s Extension Theorem 


In this section, we discuss a formal procedure to define a probability measure on a general measurable space 
(Q, F). Specifying the probability measure for all the elements of F directly is difficult, so we start with a 
smaller collection Fo of ‘interesting’ subsets of Q, which need not be a o-algebra. We should take Fo to be 
rich enough, so that the o-algebra it generates is same as F. Then we define a function Po : Fo > [0,1], 
such that it corresponds to the probabilities we would like to assign to the interesting subsets in Fo. Under 
certain conditions, this function Po can be extended to a legitimate probability measure on (Q, F) by using 
the following fundamental theorem from measure theory. 


Theorem 7.5 (Caratheodory’s extension theorem) Let Fo be an algebra of subsets of Q, and let F = 
a (Fo) be the o-algebra that it generates. Suppose that Po is a mapping from Fo to [0,1] that satisfies 
Po (Q) = 1, as well as countable additivity on Fo. 

Then, Po can be extended uniquely to a probability measure on (Q,F). That is, there exists a unique proba- 
bility measure P on (Q, F) such that P(A) = Po (A) for all A € Fo. 
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Proof: Refer Appendix A of [2]. | 


We use this theorem to define a uniform measure on (0, 1], which is also called the Lebesgue measure. 


7.4 The Lebesgue measure 


Consider Q = (0, 1]. Let Fo consist of the empty set and all sets that are finite unions of the intervals of the 
form (a, b]. A typical element of this set is of the form 


F = (a1, by] U (az, be] (2 eer (an, bn] 


where, 0 < a < bı < a2 < b2 < ... < an < bn and n E N. 
Lemma 7.6 


a) Fo is an algebra 
b) Fo is not a o-algebra 


c) oO (Fo) = B 
Proof: 


a) By definition, ® € Fp. Also, PF = (0,1] € Fo. The complement of (a1, b1] U (a2, by] is (0, a1] U (b1, a2] U 
(b2, 1], which also belongs to Fo. Furthermore, the union of finitely many sets each of which are finite 
unions of the intervals of the form (a, b] , is also a set which is the union of finite number of intervals, 
and thus belongs to Fo. 


b) To see this, note that (0 € Fo for every n, but U (0, =] = (0,1) ¢ Fo. 
n=1 


pat 
7 n+l 


c) First, the null set is clearly a Borel set. Next, we have already seen that every interval of the form 
(a, b] is a Borel set. Hence, every element of Fo (other than the null set), which is a finite union of 
such intervals, is also a Borel set. Therefore, Fo C B. This implies o (Fo) C B. 


Next we show that B C o(Fo). For any interval of the form (a,b) in Co, we can write (a,b) = 
U ((a, b— 1] N 9). Since every interval of the form (a, b— 4] € Fo, a countable number of unions of 
=1 


such intervals belongs to o (Fo). Therefore, (a,b) € o (Fo) and consequently, Co C o (Fo). This gives 
a (Co) Co (Fo). Using the fact that o (Co) = B proves the required result. 


For every F € Fo of the form 
F= (a1, bı] U (a2, b2] WY (an, bn] ; 


we define a function Po : Fo > [0,1] such that 


Po (®) = 0 and Py (F) = 


4 


n 


= 
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Note that Po (Q) = Po ((0, 1]) = 1. Also, if (a1, b1] , (a2, ba] ,..-, (an, bn] are disjoint sets, then 


implying finite additivity of Poy It turns out that Po is countably additive on Bo as well i.e., if (a1, bi] , (a2, be] ,... 
are disjoint sets such that U ((ai,0:]) € Fo, then Po U ((ai, bil) ] = => Po ((ai, bi]) = X (bi — ai). The 
2 l 1 


i=l 
proof is non-trivial and beyond the scope of this course “(see [Williams] fae a proof). Thus, in view of The- 
orem 7.5, there exists a unique probability measure P on ((0,1],8) which is the same as Pp on Fo. This 
unique probability measure on (0, 1] is called the Lebesgue or uniform measure. 


The Lebesgue measure formalizes the notion of length. This suggests that the Lebesgue measure of a singleton 
should be zero. This can be shown as follows. Let b € (0, 1]. Using (7.2), we write 


ea» =r (Â (o- 1a nn) 


n=1 


Let An = (b — L, b]. For each n, the lebesgue measure of An is 


(7.3) 


Since A, is a decreasing sequence of nested sets, 


rH 


where the second equality follows from the continuity of probability measures. 


Since any countable set is a countable union of singletons, the probability of a countable set is zero. For 
example, under the uniform measure on (0, 1], the probability of the set of rationals is zero, since the rational 
numbers in (0, 1] form a countable set. 


For Q = (0,1], the Lebesgue measure is also a probability measure. For other intervals (for example Q = 
(0, 2]), it will only be a finite measure, which can be normalized as appropriate to obtain a uniform probability 
measure. 


Definition 7.7 Let (Q,F7,P) be a probability space. An event A is said to occur almost surely (a.s) if 
P(A) =1. 


Caution: P(A) =1 does not mean A = Q. 
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Lebesgue Measure of the Cantor set: Consider the cantor set K. It is created by repeatedly removing 
the open middle thirds of a set of line segments. Consider its complement. It contains countable number of 
disjoint intervals. Hence we have: 
1 
1 2 4 3 


P(K°)=Ft+itatec= le 
pn igo ting 1-2 


Therefore P(A’) = 0. It is very interesting to note that though the Cantor set is equicardinal with (0, 1], its 
Lebesgue measure is equal to 0 while the Lebesgue measure of (0, 1] is equal to 1. 

We now extend the definition of Lebesgue measure on [0, 1] to the real line, R. We first look at the definition 
of a Borel set on R. This can be done in several ways, as shown below. 


Definition 7.8 Borel sets on R: 


e Let C be a collection of open intervals in R. Then B(R) = o(C) is the Borel set on R. 
e Let D be a collection of semi-infinite intervals {(—oo, x]; x € R}, then o(D) = B(R). 


e ACR is said to be a Borel set on R, if AN (n,n +1] is a Borel set on (n,n +1] Vn € Z. 
Exercise: Verify that the three statements are equivalent definitions of Borel sets on R. 


Definition 7.9 Lebesgue measure of AC R: 


Co 


MA) = X Pp(AN (n,n +1) 


Theorem 7.10 (R,6(R), A) is an infinite measure space. 


Proof: We need to prove following: 


e A(R) = œ 
e \(®) =0 
e The countable additivity property 
We see that 
P,(RA (n,n+1]) =1,Vnel 


Hence we have a 
AR) = X 1=00 


Now consider ®M (n,n + 1]. This is a null set for all n. Hence we have, 
Pr(®N (n,n + 1]) =0,Vn el 


which implies, 
lo e) 


Mb) = X Pr(@N(n,n+1]) =0 


n=—Co 
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We now need to prove the countable additivity property. For this we consider A; € B(R) such that the 


sequence Aj, Ao,...,An,... are arbitrary pairwise disjoint sets in B(R). Therefore we obtain, 
ML) 4) = XO PUAN (n,n + 1) 
w=1 n=—oo i=l 


The second equality above comes from the fact that the probability measure has countable additivity prop- 
erty. The last equality above comes from the fact that the summations can be interchanged (from Fubini’s 
theorem). We also have the following: 


Co 


MAi) = XO Pa(Ain (n,n +1) 


n=—CoO 


We now immediately see that 
AU Ai) = DMA) 
i=1 i=1 


Hence proved. E 


7.5 Exercises 


1. Let F be a o-algebra corresponding to a sample space 2. Let H be a subset of Q that does not belong 
to F. Consider the collection G of all sets of the form (H N A) U (H° N B), where A and B € F. 


(a) Show that HNA €G. 
(b) Show that G is a o-algebra. 


2. Show that C = o(C) iff C is a o-algebra. 
3. Let C and D be two collections of subsets of Q such that C C D. Prove that o(C) C o(D). 
4. Prove that the following subsets of (0, 1] are Borel-measurable. 


(a) any countable set 
(b) the set of irrational numbers 


(c) the Cantor set (Hint: rather than defining it in terms of ternary expansions, it’s easier to use 
the equivalent definition of the Cantor set that involves sequentially removing the “middle-third” 
open intervals; see Wikipedia for example). 


(d) The set of numbers in (0, 1] whose decimal expansion does not contain 7. 
5. Let B denote the Borel o-algebra as defined in class. Let Ce denote the set of all closed intervals 


contained in (0, 1]. Show that o(C.) = B. In other words, we could have very well defined the Borel 
o-algebra as being generated by closed intervals, rather than open intervals. 
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6. Let Q = [0,1], and let F3 consist of all countable subsets of 0, and all subsets of Q having a countable 
complement. It can be shown that F3 is a o-algebra (Refer Lecture 4, Exercises, 6(d)). Let us define 
P(A) = 0 if A countable, and P(A) = 1 if A has a countable complement. Is (Q, F3,P) a legitimate 
probability space? 


7. We have seen in 4(c) that the Cantor set is Borel-measurable. Show that the Cantor set has zero 
Lebesgue measure. Thus, although the Cantor set can be put into a bijection with [0,1], it has zero 
Lebesgue measure! 
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Lecture 8: The Infinite Coin Toss Model 


Lecturer: Dr. Krishna Jagannathan Scribe: Subrahmanya Swamy P 


In this lecture, we will discuss the random experiment where each trial consists of tossing a coin infinite 
times. We will describe the sample space, an appropriate o-algebra, and a probability measure that intuitively 
corresponds to fair coin tosses. If we denote Heads/Tails with 0/1, the sample space of this experiment turns 
out to be Q = {0,1}, and each elementary outcome is some infinite binary string. As we have seen before, 
this is an uncountable sample space, so defining a useful g-algebra on Q takes some effort. 


8.1 A o-algebra on Q = {0,1}” 


Let Fn be the collection of subsets of Q whose occurrences can be decided by looking at the result of the 
first n tosses. More formally, the elements of F, can be described as follows: A € Fn if and only if there 
exists some A‘) C {0,1}” such that A = {w € O|(w1, wa, Wn) E AM}. 


Examples: 


1 Let Aj be the set of all elements of 2 such that there are exactly 2 heads during the first 4 coin tosses. 
Clearly, Ay E€ F4. 


2 Let A» be the set of all elements of 2 such that the third toss is a Head. Then, Ag € F3. 


Also note that the following relation holds: 


Fn © Fanti Wn EN. (8.1) 


Although Fn is a o-algebra, it has the drawback that it allows us to describe only those subsets which can 

be decided in n tosses. For example, the singleton set containing all Heads is not an element of Fn for any 

n. 

In order to overcome this drawback, we define Fo = U F;. In words, Fo is the collection of all subsets of Q 
iEN 

that can be decided in finitely many coin tosses, since an element of Fo must be an element of F; for some 

iEN. 


Proposition 8.1 We claim the following: 


(i) Fo is an algebra. 


(ii) Fo is not a o-algebra. 
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Proof: 


(i) This is just definition chasing! (Left as an exercise). 


(ii) Consider the following example: Let E = {w € Q | every odd toss results in Heads}. Clearly, E ¢ Fo 
since we cannot decide the occurrence of E in finitely many tosses. On the other hand, EF can be 
expressed as a countable intersection of elements in Fo : 


E= N Azi—1, 
iT 


where A; € Fo is the set of all binary strings with Heads in the ith toss. 


Next, consider the smallest o-algebra containing all the elements of Fo, i.e., define 


F= a(Fo). 


8.2 A Probability Measure on (Q = {0,1}°, F) 


Now, we shall define a uniform probability measure on F that corresponds to a ‘fair’ coin toss model. We shall 
first define a finitely additive function Po on Fo that also satisfies Po(Q) = 1. Then, we shall subsequently 
extend Po to a probability measure P on F. 


If A € Fo, then by the definition of Fo, In such that A € Fn. By the definition of Fn, we know that for 
every A € Fp, there exists a corresponding A) C {0,1}”. We will use this A‘) in the definition of Po. We 
define Po : Fo > [0,1] as follows: 


(n) 
Po(A) = 4, 


Qn 


Having defined Po this way, we need to verify that this definition is consistent. In particular, we note that 
if A € Fn, A € Fn41, which is trivially true because F’ s are nested increasing. We therefore need to prove 
that when we apply the definition P(A) for different choices of n, we obtain the same value. We leave it to 
the reader to supply a formal proof for the consistency of Po. However, we illustrate this consistency using 
the examples provided in Section 8.1. 


(i) The occurrence of the event Az can be decided in the first 3 tosses. So, A2 € F3. The elements in 
AC) C {0,1}8 corresponding to the event A> are {(0,0,0), (0, 1,0), (1,0,0), (1,1,0)}. So |A®| = 4. 
So, Po(A2) = gs = 5. 

The event Ag can also be looked as an event in F4 since, F, C Fn+1. The elements in the correspond- 
ing A“) will be {(0,0,0, 0), (0, 1,0, 0), (1,0,0,0), (1,1, 0,0), (0,0, 0,1), (0,1, 0,1), (1,0, 0,1), (1, 1,0, 1)}. 
1 


So |AM| = 8. So, Po(Az) = & = 4 


(ii) A; can be decided by looking at the outcome of the first four tosses. So, A1 € F4. It is easy to see that 
the number of elements in A“) C {0,1}4 corresponding to the event A; that has exactly two heads is 


4 
Gi Hence, Po(A1) = G., Next, can you compute P9(A1) by considering A, as an element of, say F5? 


From the above examples, we can observe that 
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a) The definition of Po is consistent over different choices on n namely n = 3 and n = 4 for a given set 
8 
Az. 


b) The definition of Po is also consistent with the intuition of a fair coin toss model with probability of 
heads being , 


It can be easily verified that Pọ(Q) = 1 and Po is finitely additive. It also turns out that Pp is countably 
additive on Fo (the proof of this fact is non-trivial and is omitted here). This allows us to invoke the 
Caratheodory extension theorem and extend Po to P, a legitimate probability measure on (Q, F) which 
agrees with Po on Fo. In other words, there exists a unique probability measure P on (Q, F). 


As an example, let us consider the event E that is defined above (i.e. the set of strings in which all the odd 
tosses are heads). As E ¢ Fo, Po is not defined for the event E. However, it is clear that E € F, so that P 
is defined for E. Let us calculate the probability of the event E. Recall that 


E = N AÁəi—1, where A; = {w EX | Wi = 0}. 


i=l 


Let us define the event Em = men Aəi—1. In other words, Em is set of outcomes in which the first 2m tosses 


have the property of all odd tosses being heads. We can easily verify that P(Em) = Po(Em) = 3+. Note 


that {Em, m > 1} is a sequence of nested decreasing events i.e., Em D Em4i, Vm > 1. It can be easily 
verified that E can be expressed in terms of these decreasing nested events as E = Q; Em. 


Thus, 


P(E) = °(N Bn) 


where the equality (a) follows from the continuity of probability measures. 


8.3 Exercises 


1. Show that F,, (defined in equation 8.1) is a -algebra Vn € N. 


2. Recall the infinite coin toss model with Q = {0,1}; where ‘0’ denotes heads and ‘1’ denotes tails. 
Define Fn as the collection of subsets of Q whose occurrence can be decided by looking at the results of the 
first n tosses. Exercise: 


(a) Show that Fn is a o-algebra. 


It turns out that the o-algebra Fn for any fixed n is too small; after all, it can only serve to model the 
first n tosses. Let us define 


FE Ù Fy. (8.2) 
i=1 
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(b) Give a verbal description of the collection Fo. 
(c) Show that Fo is an algebra on Q. 


(d) Consider the subset A C 2 consisting of sequences in which Tails occurs infinitely many times. Does 
A E Fo? 


Is A° countable? 


—ma~ a> 
ins) ie) 
SN 


Let B be the set of all infinite sequences for which wn = 0 for every odd n; i.e., every odd numbered 
toss is Heads. Show that B can be written as a countable intersection of subsets in Fo, but B ¢ Fo. 
Therefore Fo is not a o-algebra. 


Define F = o(Fo), the o-algebra generated by Fo. 


(g) Show that every singleton {w} is F measurable. Show that the uniform measure on (Q, F) defined in 
class assigns zero probability measure to singletons. 


(h) Let A; be the set of all outcomes such that the it” toss is Tails. Note that A; € Fo. Show that A in 


part (e) can be written as 
A=()UA (8.3) 


Hence show that A is F measurable.What is P {A} under the uniform measure? 


(i) Let T C Q be the set of all coin toss sequences in which the fraction of Tails is exactly 1/2: More 


precisely, 


i 1 
Paed A as (8.4) 


n— oo n 2 
The set T is called the strong-law truth set, for reasons that will become clear later. Does T € Fo? 
(j) Show that T can be expressed as 


T=()U [\ wen | Lei aka (8.5) 


k=1 m=1 n=m 


Argue that the subset inside the nested union and intersection above belongs to Fo: Hence show that 
T is F-measurable. Hint: Don’t get intimidated by the multiple unions and intersections! Write-out 
the limit in the definition of T as the set of all w € Q such that for all k > 1; there exists an m for 
which for all n > m; we have 

n 2 
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Lecture 9: Conditional Probability and Independence 
Lecturer: Dr. Krishna Jagannathan Scribe: Vishakh Hegde 


9.1 Conditional Probability 


Definition 9.1 Let (Q, F,P) be a probability space. Let B € F such that P (B) > 0. Then the conditional 
probability of A given B is defined as, 


P(AN B) 


P(AIB) = TS 


Caution: We cannot condition on sets of zero probability measure. For example, if Q = [0,1] endowed with 
the Borel o-algebra and a uniform probability measure, we cannot condition on the set of rationals. 


Theorem 9.2 Let B € F andP(B)>0. Then, P(- |B): F — [0,1] is a probability measure on (9, F). 


Proof: We need to show that the three properties of probability measure holds true, namely: 


e P(Q|B) = 1. 
© P(|B) =0. 
e Countable additivity property. 


We have, 
p(B) = CO2 =k 
roy EGE 8 o 
We are now left with proving countable additivity property. Let A1, A2,... be disjoint. We need to show 
that, 
P (U aiB) = SOP (AlB) . 
Consider, i a 


P (UŽ 14i N B) _ P(UŽ: (Ai B)) 
e (Üa) = PE) = PE) ` 


Since A; are disjoint, A; N B are also disjoint. Therefore we can write the following: 


P (UZ; (Ain B ene oo 
UZ A D aS PB) = SPB). 


9-1 


9-2 Lecture 9: Conditional Probability and Independence 


9.1.1 Properties of Conditional Probability 


1. The Law of Total Probability: Let A € F and let {B;, i = 1,2,...} be events that partition Q (by 


partition we mean |J B; = Q and B; N B; = ¢, Vi Æ j), with P (B,) > 0, Vi. Then, 
ieN 


P(A) =) P (A|B:) P (Bi). 


Proof: We know that {B;, i = 1,2,...} partitions Q. Hence {AM B;, i = 1,2,...} partitions A. 
Therefore, by the countable additivity property, we have 


Pea) =#(U ane) -= DP (4N B). 


and P (AN B;) = P (A| B;) P (Bi) , Vi. Therefore, 


SP (An B:) = SS P(A|B,) P(B). 


Note: In particular, if B is such that 0 < P (B) < 1, then, 
P(A) = P (A| B) P (B) + P (A| B°) P (B°). 


2. Bayes’ Rule: Let A € F, with P(A) > 0 and let {B;, i = 1,2,...} be a partition of Q such that 
P(B;) > 0 Vi. Then, we have, 


_  P(A|B:)P (Bi) 
P(Bi|A) = E, P(A|B;) P (Bj) 
Proof: 
_ P(ANB:) _ P(B)P(AIB;) _ __P (A|B;)P (B;) 
MENS- ma Pa 2; P (A|B;) P (By) 


3. For any sequence of events {A;}, we have the following relation: 


P (À ai) =P (A1) | [P AN 42N... N Aya). 
i=1 i=2 

as long as all the conditional probabilities are well defined. 

Proof: We know that the following holds for finite set of events: 


n 


P (À a) =P (4:1) [ [P A1 N 42N... N Aja). 


1=2 


Now taking limits, we have: 


noo F 
i=1 1=2 


Now using continuity of probability, we get the required relation, 


P (À a) = P (41) [[P (4:410 420... Ai). 


1=2 
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9.2 Independence 


Definition 9.3 Let (Q, F,P) be a probability space. Two events A and B are said to be independent (under 
the probability measure P) if P(AN B) =P(A)P(B). 


Note: If P (B) > 0 and, A and B are independent, then we have, 


P(ANB) 
P(A|B) = —__— =P (A). 
(AIB) = py = P(A) 
Example: Can disjoint sets be independent at all? Let A,B € F be two disjoint sets. Therefore, we have 
P(AN B) = P (¢). This means that P(AN B) = 0. For independence, we need to have P(A)P(B) = 
P(AN B) =0. This can happen when P(A) = 0 or P (B) = 0. Therefore, in general, two disjoint events are 
independent if and only if at least one of them has zero probability. 


Definition 9.4 A1, A2,...,A, are independent if for all non-empty Ip C {1,2,...,n}, we have, 


e(N a) = |[ P). 


i€ Io i€ Io 
Next, we define independence of an arbitrary collection of events. 


Definition 9.5 {A;,i € I} are said to be independent if for every non-empty finite subset Io of I, we have 


e(n a) = | [ P). 


i€ Io i€ Io 


9.2.1 Independence of c-algebras 


Definition 9.6 Let Fı and Fa be two sub-c-algebras of F. We say that Fı and Fz are independent o- 
algebras if for all Ay E€ Fy and Ag € Fz, Ay and Av are independent events. 


Example: A simple example we can construct is the following: Let A,B € F, then Fı = {¢,0, A, Ac} and 
Fo = {¢,Q, B, B°} are independent iff A and B are independent. 


We now define independence on a collection of sub-o algebras. 


Definition 9.7 Let {F;,i E€ I} (where I is an index set) be a collection of sub o algebras of F. Then, 
{Fi i € I} are said to be independent if for every choice of A; € Fi, we have {A;,i € I} are independent. 


Example (from [Lecture 2, MIT OCW}): Consider the infinite coin toss model discussed previously. 


e Let A; be the event that the it” coin toss resulted in heads (say). If i Æ j, the events A; and A; are 
independent. 


e The following infinite family of events are independent: {A,;|i € N}. This example captures the intuitive 
idea of independent coin tosses. 
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e Let Fı (respectively, F2) be the collection of all events whose occurrence can be decided by looking at 
the results of the coin toss at odd times (respectively, at even times) n. More formally, let H; be the 
event that the i” toss resulted in heads. Let C = {H; | i is odd} and let Fı = o(C), so that F, is the 
smallest -algebra that contains all the events H;, for odd i. We define Fə similarly, using even times 
instead of odd times. Then, the two o-algebras Fı and Fz turn out to be independent. Intuitively, 
this implies that any event whose occurrence is determined completely by the outcomes of the tosses 
at odd times, is independent of any event whose occurrence is determined completely by the outcomes 
of the tosses at even times. 


e Let Fn be the collection of all events whose occurrence can be decided by looking at the coin tosses 
2n and 2n + 1. We know that Fn is a o-algebra with finitely many events Vn € N. It turns out that 
{Fn, n € N} are independent. 


3 Exercises 


1. (a) Let C, C € F, where F is a sigma algebra on Q. Show that Fy = {¢,0,C,C*%} and Fo = 
{¢,Q, D, D°} are independent iff C and D are independent. 


(b) Let Q = {1,2,3,....p} where p is a prime, F be the collection of all subsets of Q, and P(A) = Al 
(where |A| denotes cardinality of A) for all A € F. Show that, if A and B are independent events, 


then at least one of A and B is either ¢ or Q. 


2. In a box, there are four red balls, six red cubes, six blue balls and an unknown number of blue cubes. 
When an object from the box is selected at random, the shape and colour of the object are independent. 
Determine the number of blue cubes. 


3. A man is known to speak the truth 3 out of 4 times. He throws a die and reports that it is a six. Find 
the probability that it is actually a six. 


4. [Exercise: Q29, Bertsekas & Tsitsiklis] Let A and B be events such that P(A|B) > P(A). Show that 
P(B|A) > P(B) and P(A|B°) < P(A). 


5. [MIT OCW Assignment problem] A coin is tossed independently n times. The probability of heads at 
each toss is p. At each time k (k = 2,3,...,n) we get a reward at time k + 1 if kt” toss was a head and 
the previous toss was a tail. Let A; be the event that a reward is obtained at time k. 


a) Are events A; and A,+1 independent? 
b) Are events A, and Axi. independent? 


6. [Assignment problem, University of Cambridge] A drawer contains two coins. One is an unbiased coin, 
which when tossed, is equally likely to turn up heads or tails. The other is a biased coin, which will 
turn up heads with probability p and tails with probability 1 — p. One coin is selected (uniformly) at 
random from the drawer. Two experiments are performed: 


a) The selected coin is tossed n times. Given that the coin turns up heads k times and tails n — k 
times, what is the probability that the coin is biased? 


b) The selected coin is toss repeatedly until it turns up heads & times. Given that the coin is tossed 
n times in total, what is the probability that the coin is biased? 


7. [MIT OCW Assignment problem] Fred is giving out samples of dog food. He makes calls door to door, 
but he leaves a sample (one can) only on those calls for which the door is answered and a dog is in 
residence. On any call the probability of the door being answered is 3/4, and the probability that 
any household has a dog is 2/3. Assume that the events “door answered” and “a dog lives here” are 
independent and also that the outcomes of all calls are independent. 
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v 
nN 


Determine the probability that Fred gives away his first sample on his third call. 


oy 
wa 


Given that he has given away exactly four samples on his first eight calls, determine the conditional 
probability that Fred will give away his fifth sample on his eleventh call. 


c) Determine the probability that he gives away his second sample on his fifth call. 


a 
Sa 


Given that he did not give away his second sample on his second call, determine the conditional 
probability that he will leave his second sample on his fifth call. 


© 
Ni 


We will say that Fred needs a new supply immediately after the call on which he gives away his 
last can. If he starts out with two cans, determine the probability that he completes at least five 
calls before he needs a new supply. 


8. [MIT OCW Assignment problem] Let A, B, Ai, A2, ... be events. Suppose that for each k, we have 
Ak C Agyi, and that A; is independent of B, Vk > 1. If A = UpenAg, then show that B is independent 
of A. 


9. [Assignment problem University of Cambridge] Consider pairwise disjoint events B,, B2, B3 and C, 
with P(B,) = P(B2) = P(B3) = p and P(C) = q, where 3p +q < 1. Suppose p = —q + y4q, then 
prove that the events Bı UC, Bı UC and Bı UC are pairwise independent. Also, prove or disprove 
that there exist p > 0 and q > 0 such that these three events are independent. 


References 


[1] MIT OCW - 6.436J / 15.085J Fundamentals of Probability, Fall 2008, Lecture 2. 
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Lecture 10: The Borel-Cantelli Lemmas 


Lecturer: Dr. Krishna Jagannathan Scribe: Aseem Sharma 


The Borel-Cantelli lemmas are a set of results that establish if certain events occur infinitely often or only 
finitely often. We present here the two most well-known versions of the Borel-Cantelli lemmas. 


Lemma 10.1 (First Borel-Cantelli lemma) Let {A,,} be a sequence of events such that X` P (An) < œ. 
n=1 
Then, almost surely, only finitely many An's will occur. 


Lemma 10.2 (Second Borel-Cantelli lemma) Let {An} be a sequence of independent events such that 


S P(An) = co. Then, almost surely, infinitely many An's will occur. 
n=1 


It should be noted that only the second lemma stipulates independence. The event “A, occurs infinitely 
often (A, i.0.)” is the set of all w € Q that belong to infinitely many A,,’s. It is defined as 


{An i0} £ N U Am. (10.1) 


n=1 m=n 
Bn 
Here, Bn is the event that atleast one of An, Any1, Any2,-.. occur. Hence, {A,, i.o.} is the event that for 


every n € N, there exists atleast one m € {n,n +1,...,0o} such that Am occurs. Taking complement of 
both sides in (10.1), we get the expression for the event that An occurs finitely often (An f.o.) 


{An fo} = U N Ars. 


n=1 m=n 
In order to prove the Borel-Cantelli lemmas, we require the following lemma. 
lo) n 
Lemma 10.3 If X` pi = 0, then lim JĮ (1—p,;) =0. 
i=1 TPO 


Proof: Since In (1 — p;i) < — pi, 


n 


Il (1—p;) = [0 


i=1 i=1 


Taking limit on both the sides gives 
-5 pi 
‘ pts pa 
Jim, ji (1 — pi) < lim e i=l 


=0. 
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We now proceed towards proving the Borel-Cantelli lemmas. 


Proof: 


co co 
1. First, note that the assumption X` P (An) < co implies lim X` P(Am) = 0. Next, since Bn41 C Bn, 
= n> men 


i g d 
we can use continuity of probability to write 


a 


IA 

=r 

bE 
ins] 
T 
=. 


We have used the union bound in writing the ‘<’ above. Since P ( NU Ai) > 0, we conclude that 


n=li=n 


P ( QNU ai) = 0. This implies that, A, occurs finitely often with probability 1. 


n=li=n 


2. The event that A,, occurs finitely often (An f.o.) is given by 


{An f.o.} = U N 4s. 


n=li=n 
Now, 
P (Ù M a) < 3 P (À a) (Using union bound) 
n=1 i=n n=1 \i=n 
= D im P (a a (By continuity of probability) 
n=1 i=n 

= % Te (By independence) 

= = “(By lemma 10.3) (10.2) 
Since P U a AS ) > 0, we conclude that P ( U a As) = 0. This implies that, A, occurs infinitely 
shen with proband: ut 

E 


We now illustrate the usefulness of the Borel-Cantelli lemmas using an example. Consider an experiment in 
which a coin is tossed independently many times. Let P (Hp) be the probability of obtaining head at the 
nt” toss (and similarly for Tn). 
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lo) 

1. Suppose P (H,,) = H, n > 1. Then X P(H,,) = œ. By the second Borel-Cantelli lemma, it follows 
n=1 

that almost surely, infinitely many heads will occur. This might appear surprising at first sight, since 

as n becomes large, the probability of getting heads becomes vanishingly small. However, the decay 

rate 1/n is not ‘fast enough.’ In particular, for any n we choose (no matter how large), there occurs a 


head beyond n almost surely! 


2. Suppose now that P (Hn) = +>. Then >> P(H,,) < oo, and hence by the first Borel-Cantelli lemma, 
n=1 


almost surely, only finitely many heads will occur. In this case, the occurence of heads is decreasing 
fast enough that after a finite n, there will almost surely be no heads. Note that independence is not 
required in this case. 


Exercises 


1. Suppose that a monkey sits in front of a computer and starts hammering keys randomly on the keyboard. 
Show that the famous Shakespeare monologue starting All the worlds a stage will eventually appear (with 
probability 1) in its entirety on the screen, although our monkey is not particularly known for its good taste 
in literature. You can make reason- able additional assumptions to form a probability model; for example, 
you can assume that the monkey picks characters uniformly at random on the keyboard, and that the suc- 
cessive key strokes are independent. 


2. [MIT OCW problem set] Let An, n > 1 be a sequence of events such that P(A,,) > 0 as n > oo, and 


XO P (ASM Anyi) < 00 


n=1 


Show that almost surely, only finitely many of the Ans will occur. 


3. Online dating: On a certain day, Alice decides that she will start looking for a potential life partner on 
an online dating portal. She decides that everyday, she will pick a guy uniformly at random from among the 
male members of the dating portal, and go out on a date with him. What Alice does not know, is that her 
neighbor Bob is interested in dating her. Being of a shy disposition, Bob decides that he will not ask Alice 
out himself. Instead, he decides that he will go out on a date with Alice only on the days that Alice happens 
to pick him from the dating portal, of which he is already a member. For the first two parts, assume that 
50 new male members and 40 new female members join the dating portal everyday. 


(a) What is the probability that Alice and Bob would have a date on the nth day? Do you think Bob and 
Alice would eventually stop meeting? Justify your answer, clearly stating any additional assumptions. 


(b) Now suppose that Bob also picks a girl uniformly at random everyday, from among the female members 
of the portal, and that Alice behaves exactly as before. Assume also that Bob and Alice will meet on 
a given day if and only if they both happen to pick each other. In this case, do you think Bob and 
Alice would eventually stop meeting? 


(c) For this part, suppose that Alice and Bob behave as in part (a), i-e.,Alice picks a guy uniformly at 
random, but Bob is only interested in dating Alice. However, the number of male members in the 
portal increases by 1 percent everyday. Do you think Bob and Alice would eventually stop meeting? 
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4. Let {Sn : n > 0} be a simple random walk which moves to the right with probability p at each step, and 
suppose that So = 0. Write Xn = Sn — Sy_1. 


(a) Show that {Sn = 0 i.o} is not a tail event of the sequence {Xn}. 


(b) Show that P (Sn = 0 i.o) = 0 if p 4 å 
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Lecture 11: Random Variables 
Lecturer: Dr. Krishna Jagannathan Scribe: Sudharsan, Gopal, Arjun B, Debayani 


The study of random variables is motivated by the fact that in many scenarios, one might not be interested 
in the precise elementary outcome of a random experiment, but rather in some numerical function of the 
outcome. For example, in an experiment involving ten coin tosses, the experimenter may only want to know 
the total number of heads, rather than the precise sequence of heads and tails. 


The term random variable is a misnomer, because a random variable is neither random, nor is it a variable. 
A random variable X is a function from the sample space Q to real field R. The term ‘random’ actually 
signifies the underlying randomness in picking an element w from the sample space 2. Once the elementary 
outcome w is fixed, the random variable takes a fixed real value, X (w). It is important to remember that the 
probability measure is associated with subsets (events), whereas a random variable is associated with each 
elementary outcome w. 


Just as not all subsets of the sample space are not necessarily considered events, not all functions from Q 
to R are considered random variables. In particular, a random variable is an --measurable function, as we 
define below. 


Definition 11.1 Measurable function: 
Let (Q,F) be a measurable space. A function f : Q — R is said to be an F-measurable function if the 
pre-image of every Borel set is an F-measurable subset of Q. 


In the above definition, the pre-image of a Borel set B under the function f is given by 
J (B) £ {weEQ| fw) € B}. (11.1) 
Thus, according to the above definition, f : Q — R is an F-measurable function if f~'(B) is an F-measurable 


subset of Q for every Borel set B. 


Definition 11.2 Random Variable: 
Let (Q, F,P) be a probability space. A random variable X is an F-measurable function X : N —> R. 


In other words, for every Borel set B, its pre-image under a random variable X is an event. In Figure 11.1, 
X is a random variable that maps every element w in the sample space Q to the real line R. B is a Borel 
set, i.e., B € B(R). The inverse image of B is an event E € F. 


Since the set {w € Q|X (w) € B} is an event for every Borel set B, it has an associated probability measure. 
This brings us to the concept of the probability law of the random variable X. 


Definition 11.3 Probability law of a random variable X: 
The probability law Px of a random variable X is a function Px : B(R) > [0,1], which is defined as 
Px(B) = P({w € Q|X (w) € B}. 


Thus, the probability law can be seen as the composition of P(-) with the inverse image X~'(-), i.e., Px(-) = 
Po X~1(-). Indeed, the probability law of a random variable completely specifies the statistical properties of 
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(FP) 


Figure 11.1: A random variable X : Q — R. The pre-image of a Borel set B is an event E. 


(Q, F, P) 


Probability 


Figure 11.2: The probability law Px = Po X`! specifies the probability that the random variable X takes 
a value in some particular Borel set. 
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the random variable, as it specifies the probability of the random variable taking values in any given Borel 
set. 


In Figure 11.2, P is the mapping from event E to the probability space. Px is the mapping from B to the 
probability space such that Px is a composition of P with X71. 


Theorem 11.4 Let (Q,F7,P) be a probability space, and let X be a real-valued random variable. Then, the 
probability law Px of X is a probability measure on (R, B(R)). 


Next, we make a short digression to introduce a mathematical structure known as a m-system (read as 
pi-system). 


Definition 11.5 Given a set Q, a t-system on Q is a non-empty collection of subsets of Q that is stable 
under finite intersections. That is, P is a t-system on Q, if A, B € P implies ANB EP. 


One of the most commonly used 7-systems on R is the class of all closed semi-infinite intervals defined as 


m(R) £ {(—00, 2] : x € R}. (11.2) 


Lemma 11.6 The o-algebra generated by n(R) is the Borel o-algebra, i.e., 
B(R) = o(x(R)). 


Now, we turn our attention to a key result from measure theory, which states that if two finite measures 
agree on a 7-system, then they also agree on the o-algebra generated by that 7-system. 


Lemma 11.7 Uniqueness of extension, 1-systems:- Let Q be a given set, and let P be a m-system 
over Q. Also, let © = o(P) be the o-algebra generated by the x-system P. Suppose 41 and u2 are measures 
defined on the measurable space (Q, £) such that p(Q) = u2(Q) < co and py = u2 on P. Then, 


Hy = H2 on X. 


Proof: See Section A1.4 of [1]. E 
In particular, for probability measures, we have the following corollary: 


Corollary 11.8 If two probability measures agree on a t-system, then they agree on the o-algebra generated 
by that 1-system. 


In particular, if two probability measures agree on 7(R), then they must agree on B(R). This result is of 
importance to us since working with o-algebras is difficult, whereas working with 7-systems is easy! 


11.1 Cumulative Distribution Function (CDF) of a Random Vari- 


able 
Let (Q, F, P) 
(R 


be a Probability Space and let X : Q — R be a random variable. Consider the probability 
space (R, B(R), 


Px) induced on the real line by X. Recall that B(R) = o(7(R)) is the Borel o-algebra whose 
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generating class is the collection of semi-infinite intervals (or equivalently, the open intervals). Therefore, for 
any z ER, 
(—oo, x] € B(R) = X~! ((—00, 2]) € F. 


It is therefore legitimate to look at the probability law Px of these semi-infinite intervals. This is, by defi- 
nition, the Cumulative Distribution Function (CDF) of X, and is denoted by F'x(.). 


Definition 11.9 The CDF of a random variable X is defined as follows: 

Fx (x) £ Px ((—00, 2]) = P({w|X(w) < z}), TER. (11.3) 
Since the notation P({w|X(w) < x}) is a bit tedious, we will use P(X < x) although it is an abuse of 
notation. Remarkably, it turns out that it is enough to specify the CDF in order to completely characterize 
the probability law of the random variable! The following theorem asserts this: 
Theorem 11.10 The Probability Law Px of a random variable X is uniquely specified by its CDF Fx(.). 
Proof: This is a consequence of the uniqueness result, Lemma 11.7. Another approach is to use the 
Carathéodory’s extension theorem. Here, we present only an overview of the proof. 
Let Fo denote the collection of finite unions of sets of the form (a, b], where a < b and a,b € R. Define a set 


function P° : Fo — [0,1] as P°((a, b]) = Fx (a) — Fx(b). Having verified countable additivity of P° on Fo, 
we can invoke Carathéodory’s Theorem, thereby obtaining a measure Py which uniquely extends P° on B(R). 


11.2 Properties of CDF 


Theorem 11.11 Let X be a random variable with CDF Fx(.). Then Fx(.) posses the following properties: 


1. Ifa < y, then Fx(x) < Fx(y) i.e. the CDF is monotonic non-decreasing in its argument. 


2. lim Fy(x) = 1 and lim Fx(x) = 0. 
3. Fx(.) is right-continuous i.e. V x € R, lim Fx(x +€) = Fx(2). 
Proof: 


1. Since, for x < y, {w| X (w) < x} C {w| X (w) < y}, from monotonicity of the probability measure, it 
follows that Fyx(x) < Fx (y). 


2. We have 
peer a a 
() lip, POX San), 
(c) (Qx sm), 
= neN 
= PO) =0; 
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where (a) follows from the definition of a CDF, (b) follows by considering a sequence {£n }nen that 
decreases monotonically to —oo, and (c) is a consequence of continuity of probability measures. 


Following a very similar derivation, and considering a sequence {£n }nen that monotonically increases 
to +00, we get: 


lim Fx(xz) = lim P(X < 2v), 


xL— w—>oo 


= i p A 
Muna (x “= £n), 


an e (Utixo <a), 


nEN 
= P(Q), 
=> Ws 


3. Consider a sequence {€n }nen decreasing to zero. Therefore, for each x € R, 
lim Fx (x +e) (a) lim P(X <a@+e), 


= lim P(X <2+e,), 


n— oo 


(b) (Qux sete), 
neN 
= P(X <2), 


= Fx (a), 


where (a) follows from the definition of CDF, and (b) follows from continuity of probability measures. 


Note that, in general, a CDF need not be continuous. But right-continuity must necessarily be satisfied by 
any CDF. It turns out that not only are the above three properties satisfied by all CDFs, but any function 
that satisfies these properties is necessarily a CDF of some random variable! 


Theorem 11.12 Let F be a function satisfying the three properties of a CDF as in theorem (11.11). Con- 
sider the Probability Space Q = ([0, 1), B([0,1)), A). Then, there exists a random variable X : Q — R whose 
CDF is F. 


A constructive proof can be found in [3]. 
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Lecture 11: Random Variables: Types and CDF 


Lecturer: Dr. Krishna Jagannathan Scribe: Sudharsan, Gopal, Arjun B, Debayani 


In this lecture, we will focus on the types of random variables. Random variables are categorized into various 
types, depending on the nature of the measure Px induced on the real line (or to be more precise, on the 
Borel o-algebra). Indeed, there are three fundamentally different types of measures possible on the real 
line. According to an important theorem in measure theory, called the Lebesgue decomposition theorem (see 
Theorem 12.1.1 of [2]), any probability measure on R can be uniquely decomposed into a sum of these three 
types of measures. The three fundamental types of measure are 


e Discrete, 
e Continuous, and 
e Singular. 
In other words, there are three ‘pure type’ random variables, namely discrete random variables, continuous 


random variables, and singular random variables. It is also possible to ‘mix and match’ these three types to 
get four kinds of mixed random variables, altogether resulting in seven types of random variables. 


Of the three fundamental types of random variables, only the discrete and continuous random variables are 
important for practical applications in the field of engineering and statistics. Singular random variables are 
largely of academic interest. Therefore, we will spend most of our effort in studying discrete and continuous 
random variables, although we will define and give an example of a singular random variable. 


11.1 Discrete Random Variables 


Definition 11.1 Discrete Random Variable: 
A random variable X is said to be discrete if it takes values in a countable subset of R with probability 1. 


Thus, there is a countable set E = {x1,%2,...}, such that Px(£) = 1. Note that the definition does not 
necessarily demand that the range of the random variable is countable. In particular, for a discrete random 
variable, there might exist some zero probability subset of the sample space, which can potentially map to 
an uncountable subset of R. (Can you think of such an example?) 


Definition 11.2 Probability Mass Function (PMF): 
If X is a discrete random variable, the function px : R — [0,1] defined by px(x)=P(X = x) for every x is 
called the probability mass function of X. 


Although the PMF is defined for all x € R, it is clear from the definition that the PMF is non-zero only on 
the set Æ. Also, since Px(E£) = 1, we must have (by countable additivity) 
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3+ in Y 
2+ 


ll t+ Px(a1)=.2 
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Figure 11.1: CDF of a discrete random variable 


Interestingly, for a discrete random variable X, the PMF is enough to get a complete characterization of the 
probability law Px. Indeed, for any Borel set B, we can write 


Px(B)= X. P(X Sa;). 


i: ©,E€B 


The CDF of a discrete random variable is given by 


Figure 15.3 represents the Cumulative Distribution Function of a discrete random variable. One can observe 
that the CDF plotted in Figure 15.3 satisfies all the properties discussed earlier. 


Next, we give some examples of some frequently encountered discrete random variables. 


1. Indicator random variable: Let (Q, F, P) be a probability space, and let A € F be any event. Define 


1, weA, 
Taw) = { 0, wA. 


It can be verified that J4 is indeed a random variable (since A and A° are F-measurable), and it is 
clearly discrete, since it takes only two values. 


2. Bernoulli random variable: Let p € [0,1], and define px (0) = p, and px(1) = 1 — p. This random 
variable can be used to model a single coin toss, where 0 denotes a head and 1 denotes a tail, and the 
probability of heads is p. The case p = 1/2 corresponds to a fair coin toss. 


3. Discrete uniform random variable: Parameters are a and b where a < b. 
px(m)=1/(b—a+ 1), m = a, a + 1,..b, and px(m)=0 otherwise. 
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4. Binomial random variable: px (k) = (})p”(1— p)” 7", where n € N and p € (0, 1]. In the coin toss exam- 
ple, a binomial random variable can be used to model the number of heads observed in n independent 
tosses, where p is the probability of head appearing during each trial. 


5. Geometric random variable: px(k) = p(1 — p)*~!, k = 1,2,... and 0 < p < 1. A geometric random 
variable with parameter p represents the number of (independent) tosses of a coin until heads is observed 
for the first time, where p represents the probability of heads during each toss. 


=À 


6. Poisson: Fix the parameter \ > 0, and define px (k) = cn where k = 0,1,... 


Note that except for the indicator random variable, we have described only the PMFs of the random variables, 
rather than the explicit mapping from Q. 


11.2 Continuous Random Variables 


11.2.1 Definitions 


Let us begin with the definition of absolute continuity which will allow us to define continuous random 
variables formally. Let u and v be measures on (Q, F). 


Definition 11.3 We say v is absolutely continuous with respect to u if for every N € F such that u(N) = 0, 
we have v(N) =0. 


Now, let (Q, F, P) be a probability space and X : Q > R a random variable. 


Definition 11.4 X is said to be a continuous random variable if the law Px is absolutely continuous with 
respect to the Lebesgue measure X. 


Here, both Px and A are measures on (R, 8). The above definition says that X is a continuous random 
variable if for any Borel set N set of Lebesgue measure zero, we have Px (N) = P(w|X(w) € N) =0. 


In particular, it is not the case that a random variable is continuous if it takes values in an uncountable set. 


Next, we invoke without proof a special case of the Radon-Nikodym Theorem [3], which deals with absolutely 
continuous measures. 


Theorem 11.5 Suppose Px is absolutely continuous with respect to A, the Lebesgue measure, then there 
exists a non-negative, measurable function fx :R— [0,co), such that for any B € B(R), we have 


Px(B) = ffx dd. (11.1) 


The integral in the above theorem is not the usual Riemann integral, as B may be any Borel measurable 
set, such as the Cantor set, for example. We will get a precise understanding of the integral in (15.1) when 
we study abstract integration later in the course. For the time being, we can just think of the set B as an 
interval [a,b], so (15.1) essentially says that the probability of X taking values in the interval [a,b] can be 
written as T fxdx for some non-negative measurable function fx. Here, when we say fx is measurable, 
we mean the pre-images of Borel sets are also Borel sets. In measure theoretic parlance, fx is called the 
Radon-Nikodym derivative of Px with respect to the Lebesgue measure A. 
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In particular, taking B = (—oo, x], we can write the cumulative distribution function (CDF) as 
P(v) &Px((-00,2)) = [fru a. (11.2) 


Thus, we can understand fx as the probability density function (PDF) of X, which is nothing but the 
Radon-Nikodym derivative of Px with respect to the Lebesgue measure A. Also, 


Px(R)=1 = a fx (y) dy. 


Unlike the probability mass function in the case of a discrete random variable, the PDF has no interpretation 
as a probability; only integrals of the PDF can be interpreted as a probability. 


The function fx is unique only up to a set of Lebesgue measure zero, as we will understand later. We also 
remark that many authors (including [4]) define a random variable as being continuous if the CDF satisfies 
(15.2). This definition can be shown to be equivalent to the one we have given above. 


11.2.2 Examples 
The following are some common examples of continuous random variables: 


1. Uniform: It is a scaled Lebesgue measure on a closed interval [a, b]. 


0 for r<a 
(a) PDF- fx(z)=¢ z4 for a<a<b 
0 for x>b 


fx(z) 


Figure 11.2: The PDF of a uniform random variable 


0 for xr<a 
(b) CDF: Fx (x) = 7—4 for a<a<b 
1 for «>b 


2. Exponential: It is a non-negative random variable, characterized by a single parameter A > 0. 


(a) PDF: fx(z) = eò? for x>0 

(b) CDF: Fx(z)=1-—e-** for «>0 

(c) The exponential random variable posses an interesting property called the ‘memoryless’ property. 
We first give the definition of the memoryless property, and then show that the exponential 
random variable has this property. 
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Fx (2) 


Figure 11.3: The CDF of a uniform random variable 


Figure 11.4: The PDF of an exponential random variable, for various values of the parameter A 


Definition 11.6 A non-negative random variable X is said to be memoryless if P(X > s+t|X > 
t)=P(X >s) Vs,t>0. 


For an exponential random variable, 


P(X >s4+t)&(X >t)) 
P(X >t) 
P(X >s+t) 
P(X >t) 
e7 (stt)a 


P(X >s+t|X >t) 


eta 


= e 3% 


= P(X >s). 


Therefore, the exponential random variable is memoryless. For example, if the failure time of a 
light bulb is distributed exponentially, then the further time to failure, given that the bulb has not 
failed until time t, has the same distribution as the unconditional failure time of a new light bulb! 
Interestingly, it can also be shown that the exponential random variable is the only continuous 
random variable which possesses the memoryless property. 
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Figure 11.5: The CDF of an exponential random variable, for various values of the parameter A 


Gaussian (or Normal): This is a two parameter distribution, and as we shall interpret later, these 
parameters are the mean u € R and standard deviation ø > 0. It has wide applications in engineering 
and statistics, owing to a ‘stable-attractor” property of Gaussian random variables. We will study 
these properties later. 


(a) PDF: The probability density function of a Gaussian random variable is given by fx(x) = 
=(x=p)? 


le 362 for «ER. 
oV2n 


The above distribution is denoted N(y,07). In particular, when u = 0 and g? = 1, we get the 


R 
s| N 


standard Gaussian PDF: fx(x) = Tere 


EEEF 
popp 


Figure 11.6: The PDF of a normal random variable, for various parameters 
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(b) CDF: There is no closed-form expression for the CDF of a Gaussian distribution (although the 
notion of a ‘closed-form’ is itself rather arbitrary, and over-rated!). For convenience, we call the 


y2 
CDF of the standard Gaussian the “error-function” Erf(z) = f? yee 7 dy. 


É aa ae 
— 
8 
~~ 
re 
—p=0,07=.2 
—p=0,07 =1 
—p=0,07 =5 
3 — p = -2,02 = .5 
T 
=5 5 


Figure 11.7: The CDF of a normal random variable, for various parameters 


4. Cauchy: This is a two-parameter distribution parametrised by zo € R, the centering parameter, and 
y > 0, the scale parameter. It is qualitatively very different from the previous distributions, because it 
is “heavy-tailed,” i.e., its complementary CDF 1 — Fx (a) decays slower than any exponential. Heavy- 
tailed random variables tend to take very large values with non-negligible probability, and are used to 
model high variability and burstiness in engineering applications. 


(a) PDF: fx(«) = i uar: 


| 

8 
oooco 
SoS 
Pa 22 
2 nd 
oo 


8 


Wow weal 
lw: 
= 


Figure 11.8: The PDF of a Cauchy random variable, for different parameters 
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11.3 Singular Random Variable 


Singular random variables are rather bizzare, and in some sense, they occupy the ‘middle-ground’ between 
discrete and continuous random variables. In particular, singular random variables take values with proba- 
bility one on an uncountable set of Lebesgue measure zero! 


Definition 11.7 A random variable X is said to be singular if, for every x € R, we have Px ({x}) = 0, and 
there exists a zero Lebesgue measure set F € B(R), such that Px(F) =1. 


Although it is not stated explicitly in the definition, it is clear that F must be an uncountable set of Lebesgue 
measure zero. (Why?) 


Example A random variable having the Cantor distribution as its CDF is an example of a Singular 
random variable. The range of this random variable is the Cantor Set, C, which is a Borel set with Lebesgue 
measure zero. Further, if x € C, then x has a ternary expansion of the following form 


t= y a where 2; € {0,2}. (11.3) 
i=1 


Fx (x) 


T 


Figure 11.9: The Cantor Function 


To look at a concrete example, consider an infinite sequence of independent tosses of a fair coin. When 
the outcome is a head, we record x; = 2, otherwise, we record x; = 0. Using these values of x; we form 
a number x using (15.3). This results in a random variable X. This random variable satisfies the two 
properties that make it a Singular Random variable, namely Px(C) = 1, and Px({x}) = 0, V x € [0,1]. 
The cumulative distribution function of this random variable, shown in Figure 15.9, is the Cantor function 
(which is sometimes referred to as the Devil’s staircase). The Cantor function is continuous everywhere, 
since all singletons have zero probability under this distribution. Also, the derivative is zero wherever it 
exists, and the derivative does not exist at points in the Cantor set. The CDF only increases at these Cantor 
points, but does so without a well defined derivative, or any jump discontinuities for that matter! 


11.4 Exercises: 


1. (a) Prove Theorem 15.4. 
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(b) Verify that a(R), defined in the lecture on Random Variables is indeed a 7-system over R. 
(c) Prove Lemma 15.6. 
(d) Plot the CDF of the indicator random variable. 


2. For a random variable X, prove that Px({y}) = Fx(y) — limsatyFx(x). Hence show that Fx is 
continuous at y if and only if Px({y}) =0. 


3. Among the functions given below, find the functions that are valid CDFs and find their respective 
densities. For those that are not valid CDFs, explain what fails. 


(a) 
Fe) ={ Ce et (11.4) 


(b) p 
F(z) ={ e 5 ae 7 (11.5) 


(c) 
(11.6) 


paal 

= 

II 
Role © 
SO 
V AIA 
neS o 

IA 

Nie 


4. Negative Binomial Random Variable. Consider a sequence of independent Bernoulli trials {X;}ien 
with parameter of success p € (0,1]. The number of successes in first n trials is given by 


ao arene oe 


Yn is distributed as Binomial with parameters n and p. 
Consider the random variable defined by 


Vk = min{n € N4 : Yn = k}. 


Note that Vı is distributed as Geometric with parameter p. 


(a) Give a verbal description of the random variable Vp. 
(b) Show that the probability mass function of the random variable V; is given by 
P(V: = n) = (F) 0 — p) 
where n € {k,k+1,...}. This is known as Negative Binomial Distribution with parameters k and 
p. 
(c) Argue that Binomial and Negative Binomial Distributions are inverse to each other in the sense 


that 


5. Radioactive decay. Assume that a radioactive sample emits a random number of a particles in 
any given hour, and that the number of a particles emitted in an hour is Poisson distributed with 
parameter A. Suppose that a faulty Geiger-Muller counter is used to count these particle emissions. 
In particular, the faulty counter fails to register an emission with probability p, independently of other 
emissions. 


(a) What is the probability that the faulty counter will register exactly k emissions in an hour? 
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(b) Given that the faulty counter registered k emissions in an hour, what is the PMF of the actual 
number of emissions that happened from the source during that hour? 


6. Buses arrive at ten minute intervals starting at noon. A man arrives at the bus stop at a random time 
X minutes after noon, where X has the CDF: 


0 «<0 
Fx(a)=4 ģæ OS 2 < 60 (11.7) 
1 x>60 


What is the probability that he waits less than five minutes for a bus? 


7. Find the values of a and b such that the following function is a valid CDF: 


l—ae*/> g¢>0 
F(x) = { On z 0. (11.8) 


Also, find the values of a and b such that the function above corresponds to the CDF of some 


(a) Continuous Random Variable 
(b) Discrete Random Variable 
(c) Mixed type Random Variable 


8. Let X be a continuous random variable. Show that X is memoryless iff X is an exponential random 
variable. 
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Lecture 12: Multiple Random Variables and Independence 
Instructor: Dr. Krishna Jagannathan Scribes: Debayani Ghosh, Gopal Krishna Kamath M, Ravi Kolla 


12.1 Multiple Random Variables 


In this lecture, we consider multiple random variables defined on the same probability space. To begin with, 
let us consider two random variables X and Y, defined on the probability space (Q, F, P). It is important to 
understand that the realizations of X and Y are governed by the same underlying randomness, namely w € 2. 
For example, the underlying sample space could be something as complex as the weather on a particular 
day; the random variable X could denote the temperature on that day, and another random variable Y, the 
humidity level. Since the same underlying outcome governs both X and Y, it is reasonable to expect X and 
Y to posses a certain degree of interdependence. In the above example, a high temperature on a given day 
usually says something about the humidity. 


In Figure (12.1), the top picture shows two random variables X and Y, each mapping 2 to R. These two 
random variables are measurable functions from the same probability space to the real line. The bottom 
picture in Figure (12.1) shows (X(-), Y(-)) mapping Q to R?. Indeed, the bottom picture is more meaningful, 
since it captures the interdependence between X and Y. 


Now, an important question arises: is the function (X(-),Y(-)) : Q — R? measurable, given that X 
and Y are measurable functions? In order to pose this question properly and answer it, we first need 
to define the Borel o-algebra on R?. The Borel o-algebra on R? is the o-algebra generated by the class 
T (R?) £ {(—00, a] x (—00, y] | x,y € R}. That is, 


B(R?) = o (7 (R?) ). 


The following theorem asserts that whenever X and Y are random variables, the function (X,Y): Q —> R? 
is F-measurable, in the sense that the pre-images of Borel sets on R? are necessarily events. 


Theorem 12.1 Let X and Y be two random variables on (Q,F,P). Then, (X(),Y()) : Q > R? is F- 
measurable, i.e., the pre-images of Borel sets on R? under (X(-),Y(-)) are events. 


Proof: Let G be the collection of all subsets of R? whose pre-images under (X(-), Y(-)) are events. To prove 
the theorem, it is enough to prove that B (R?) CG. 


Claim 1: G is a ø- algebra of subsets of R?. 


Next, note that {w|X(w) < z}, {w]¥(w) < y} € F, Vx,y E€ R, since X and Y are random variables. Thus, 
{|X (w )< abn {w|¥(w) < y} € F, Va,y € R, since F is a ø- algebra. 
So, {w|X(w) < 2, Y¥(w) < y} € F, Va,y E R => (—œ, 2] x (—00, y] € G, Vr, y € R (from the definition of G) 


=> 7 (R?) C g => o (r (R?)) C o (G) > B (R?) CG. m 


Since the pre-images of Borel sets on R? are events, we can assign probabilities to them. This leads us to 
the definition of the joint probability law. 
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Xx) 
YQ) X(a) 


Y() R 


KOYO) 


(X(@), Y(@)) 


Figure 12.1: Illustration of Multiple Random Variables 


Definition 12.2 The joint probability law of the random variables X and Y is defined as: 
Px,y(B) =P ({w € O|(X(w), Yw)) € B}), Be BR’), 
where B(R?) is the Borel c-algebra on R?. 
In particular, when B = (—ov, x] x (—o0, y], we have 
Px,y ((—00, a] x (—00, y]) = P ({w|X(w) < x, Y (w) < y}). (12.1) 


The LHS in (12.1) is well defined, and hence the RHS in (12.1) is well defined and is called as the joint CDF 
of X and Y. 


12.2 Joint CDF 


Definition 12.3 Let X and Y be two random variables defined on the probability space (Q,F,P). The joint 
CDF of X and Y is defined as follows: 


Fy y (x,y) =P ({o|X@) <2,Y)<y}), VeyeR. 


In short hand, we write Fy y (x,y) =P(X <2,Y < y). 
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12.2.1 Properties of joint CDF: 


1. Jim Fxy(@y)=1, lim Fx.y (x,y) = 0. 
yoo y—>— o 
Proof: Let {xn} and {yn} be two unbounded, monotone-increasing sequences. We have 


jim Fx,y (2,4) = lim P(X <2,Y <y), 


yoo yoo 


= lim P(X <an,Y < yn), 


2e(Uwxw) < an, Y (w) <w) ; 


where (a) is due to continuity of probability measures (Lecture #5, Property 6). Proof of the other 
part follows on the similar lines and is left as an exercise to the reader. 
Note that the order of the two limits does not matter here. a 


2. Monotonicity: For any zı < 2, Yı < yo, Fx,y (1,91) < Fry (£2, y2). 
Proof: Let xı < x2 and yı < yo. Clearly, events {X <a1,Y < yı} C {X < a2,Y < yo}. Then, 
P(X < x1,Y < y1) < P(X < x2,Y < y2) > Fry (21,91) < Fx,y (£2, y2). E 


3. Fx,y is continuous from above, i.e., lim Fx y (x +u,y +v) = Fry (x,y), Yz,y E R. 
u—> 0" 


v0" 
Exercise: Prove this. 


4. jim Pry (x,y) = Fx (x) 


Proof: Let {yn} be an unbounded, monotone-increasing sequence. Then, jim Fxy (x,y) = lim Fy y (2, yn). 
n—->co 


Hence, 
n— o0 n—> o0 
© p (Üe : X(w) < xz, Y (w) < n) ; 
n=1 
=P(w:X(w) <2), 
= Fx (x), 
where (a) is due to continuity of probability measure. (Lecture #5, Property 6). E 


Using the above property, we can calculate the marginal CDFs from joint CDF. However, the joint CDF 
cannot be obtained from the marginals alone, since the marginals do not capture the inter-dependence 
of X and Y. 


12.3 The o-algebra generated by a random variable 


Before we proceed to define the independence of random variables, it is useful to understand the notion of 
the o-algebra generated by a random variable. We first state an elementary result that holds for any arbitrary 
function. 
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Proposition 12.4 LetQ and S be two non-empty sets and let f : Q — S be a function. IfH is a o-algebra 
of subsets of S, then G £ {A | A= f-'(B) ,B € H} is a o-algebra of subsets of Q. 


In words, Proposition (12.4) says that the collection of pre-images of all the sets belonging to some o-algebra 
on the range of a function, is a o-algebra on the domain of that function. 


Let (Q, F, P) be a Probability Space and X : Q — R be a random variable. X in turn induces the probability 
triple (R, B(R), Px) on the real line. 
Definition 12.5 The o-algebra generated by the random variable X is defined as 

o(X) £ {E C Q|E = X71(B), VB € B(R)}. (12.2) 


Proposition (12.4) asserts that o(X) defined above is indeed a a-algebra on Q. 


Proposition 12.6 o(X) C F, i.e., the o-algebra generated by X is a sub-o-algebra of F. 


Figure (12.3) shows a pictorial representation of the o-algebra generated by X. Each Borel set B maps back 
to an event Æ. A collection of all such preimages of Borel sets constitutes the o-algebra generated by X. 
Thus, o(X) is a o-algebra that consists precisely of those events whose occurrence or otherwise is completely 
determined by looking at the realised value X(w). To get a more concrete idea of this concept, let us look 
at the following examples: 


N? x 


Figure 12.2: The collection of the pre-images of all Borel Sets is the o-algebra generate by the random 
variable X, denoted o(X). 


Example 1:- Let (Q, F, P) be a probability space and A € F be some event. Consider the Indicator random 
variable of event A, I4. It is easy to see that o(I4) = {0, A, AT, Q}. Also, (4) C F. 
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Example 2:- Let ([0, 1], B((0, 1]), å) be the probability space in consideration, and consider a random vari- 
able X(w) = w, Vw € Q. It can be seen that o(X) = F. 


Remark: 12.7 As seen from the above two examples, o(X) could either be “small” (as seen in example 1 
above) or as “large” as the o-algebra F itself (as seen in example 2 above). 


Now, we introduce the important notion of independence of random variables. 


12.4 Independence of Random Variables 


Definition 12.8 Random variables X and Y are said to be independent if o (X) anda (Y) are independent 
o-algebras. 


In other words, X and Y are independent if, for any two borel sets Bı and By on R, the events {w : X(w) € 
By} and {w : Y (w) € Bo} are independent i.e., 


P (tw : X(w) € Bi} w: Yw) € B2}) =P ({w : X(w) € BYP {w : Y (w) € Bo}), VBr, Bo € B(R). 


The following theorem gives a useful characterization of independence of random variables, in terms of the 
joint CDF being equal to the product of the marginals. 


Theorem 12.9 X and Y are independent if and only if 

Fyx,y (x,y) = Fx (x)Fy (y). 
Proof: First, we prove the necessary part, which is straightforward. Let X and Y are independent. Consider 
Bı € B (R), Bo € B (R) then the events {w| X (w) € Bı} and {w|Y (w) € By} are independent (due to defi- 
nition (12.8)) = P(X € B,Y € B2) = P(X E Bı) P (Y E B2) = Px,y (Bı x Bə) = P(X € Bı)P (Y = Bə). 


But, this is true for all borel sets in R. In particular, choose Bı = (—co,z] and By = (—oo, y] then we get 
Fy y(a,y) = Fx(x)Fy (y), Vz, y E€ R which completes the proof of the necessary part. 


The sufficiency part is more involved; refer [1][Section 4.2]. | 


Definition 12.10 X,, X2,---X,, random variables are said to be independent if o-algebras o (X1) , o (X2), >, 
o (Xn) are independent i.e., for any B; € B(R),1<i<n, we have 
P(X, € By, X2 € B2, Xn € By) = [ [P (X: € Bi). 


i=l 


Theorem 12.11 X1, Xo,---Xy are independent if and only if 


Fx, Xa Xn (81) 22)+++ Bn) = |] Fx, (a). 


Proof: Refer [1][Section 4.2]. | 


Finally, we define independence for an arbitrary family of random variables. 


Definition 12.12 An arbitrary family of random variables, {X;,€ I}, is said to be independent if the o— 
algebras {o (X;),i E€ I} are independent (Lecture #9, Section 9.2, Definition 9.7). 
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12.5 Exercises 


1. Prove Claim 1 under Theorem 12.1. 


2. For random variables X and Y defined on same probability space, with joint CDF Fy y(x, y), prove 
that lim Fx y (x,y) =0. 
T= Co 


y—>— o 
3. Prove Propositions (12.4) and (12.6). 


4. [Quiz IT 2014] Suppose X and Y are independent random variables, and f(X) and g(Y) are functions 
of X and Y respectively. Will the random variables f(X) and g(Y) be independent? Justify your 
answer. 
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Lecture 13: Conditional Distributions and Joint Continuity 


Lecturer: Dr. Krishna Jagannathan Scribe: Subrahmanya Swamy P 


13.1 Conditional Probability for Discrete Random Variables 


If X and Y are discrete random variables, then the range of the map (X(-), Y(-)) is a countable subset of 
R?. This is because the Cartesian product of two countable sets is countable (Why?). Hence, (X(.), Y (.)) is 
a discrete random variable on R?. We will see later that we do not have similar result when X and Y are 
continuous random variables i.e., if X and Y are marginally continuous random variables, they need not be 
jointly continuous. 


The joint pmf of discrete random variables X and Y is defined as: 


pxy (t,y)=P(X =2,Y = y), x,y ER. 


The joint pmf uniquely specifies the joint law. In particular, for any B € B (R?), 


Pxy(B)= >> pxy (2,9). (13.1) 
x,yEB 


An example of two discrete random variables is shown in Figure (13.1). 


13.1.1 Conditional pmf 


Now, we define the conditional pmf for discrete random variables. 


Definition 13.1 Let X and Y be discrete random variables defined on (Q,F,P) . Conditional probability of 
X given Y is defined as: 


P(X =2,Y=y) _ pxy(z,y) 


PY =y) ag there py (y) > 0. 


pxiy (zly) = P(X =alY =y) 


The following theorem characterizes independence of discrete random variables in terms of the conditional 
pmf. 


Theorem 13.2 The following statements are equivalent for discrete random variables X and Y: 


(a) X,Y are independent. 
(b) For all x,y € R, {X = z} and {Y = y} are independent. 
(c) For all x,y € R, Px y (x,y) = Px(x)Py (y). 


(d) For all x,y E R such that py (y) > 0, px\y (zly) = px (a). 
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(XO), YO) 


B 
>k 
kK >k 
os =k > > 


Figure 13.1: An example of two discrete random variables. The probability measure assigned to any Borel 
set B on R2 can be obtained by summing the joint pmf over the set B; see (13.1). 


Proof: (b) = (c) and (c) (d) are directly follow from the definitions. Now, we prove equivalence of (a) 
and (c). 


(a) = (¢) : 
X and Y are independent > P(X € B,,Y € B2) = P(X € Bi) P(Y € Bo). Take Bı = {x} and By = {y} 
then the result follows. 


(c) = (a) : 


P(X €B,Ye€B)= X. pxy(a,y)= >> J px(a)py(y) = J px(x) $ pri) 


xEB1,yEB2 xE Bı yEB2 xEBı yEBo 
=P(X € B,)P(Y € Bə). 


13.2 Jointly Continuous Distributions 


Definition 13.3 Two Random variables X and Y are said to be jointly continuous, if the joint probability 
law Px y is absolutely continuous with respect to the Lebesgue measure on R?. That is, for every Borel set 
N CR? of Lebesgue measure zero, we have P({(X,Y) € N}) =0. 


The Radon-Nikodym theorem for this situation would assert the following 


Theorem 13.4 X,Y are jointly continuous random variables if and only if there exists a measurable function 
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Y 
Q 
Y=2X 
o2 XOY 
al 
3 > 
p X 
Figure 13.2: Y = 2X 
fx.y : R? > [0,00) such that for any Borel set B on R?, 
Px, (B) = J ixx, 
B 
where À is Lebesgue measure on R?. 
In particular, taking B = (—oo, x] x (—co, y], we have 
z y 
Fx y(x,y) =PUX <2,Y¥ <y})= J / fx.y (u, v)dudu, (13.2) 


where Fx y (X,Y) and fx y (x,y) are the joint cdf and joint pdf respectively. The joint pdf is thus a complete 
characterization of the joint law, for jointly continuous random variables. 


Caution: If X is continuous and Y is continuous, (X,Y) need not be jointly continuous. This can be seen 
from the following example. 


Example: Let X ~ NV(0,1) and Y = 2X. ie. Y ~ N(0,4). In this case, though X is continuous and Y is 
continuous, (X,Y) are not jointly continuous. 

This can be understood from the Figure 13.2. Each w € Q is mapped on to the straight line Y = 2X on 
R?. The Lebesgue measure of the line (set), L = {(z,y) € R? : y = 2z} is zero, but the corresponding 
probability, Px,y(Z) = 1, since every w € Q is mapped to this straight line. Thus, from the definition of 
jointly continuous random variables, X and Y are not jointly continuous. 


13-4 Lecture 13: Conditional Distributions and Joint Continuity 


On the other hand, if X and Y are jointly continuous, their marginals are necessarily continuous. To see 
this, note that 


P(X <2,Y <y})= j j fx,y (u, v)dvdu 


—00 — 00 


=> P(X < x) = / | Fasc du = / fx(u)du. (13.3) 


In (13.3), it is clear that the inner integral in the parentheses produces a non-negative measurable function 
of u. Thus, (13.3) asserts that the marginal CDF of X can be written as the integral of a non-negative 
measurable function, which can be identified as the marginal pdf fx. Thus, X is continuous and fx given 
by the inner integral in (13.3) is the pdf of X. A similar argument holds for the marginal pdf of Y. 


13.3 Independence of Jointly Continous Random Variables 


For any two random variables X and Y, they are said to be independent iff, 
Fx,y (x,y) = Fx(x)Fy (y) Vz,y E R 


Applying this definition, for the particular case of jointly continuous random variables, 


T 


Jf tn = J roau J sora 


= | [niron 


—00 — 00 


Since the above equality holds for all x,y € R, the integrands must be equal almost every where, i.e., 


fx y (x,y) = fx (x) fy (y) Vzy ER 


except possibly on a subset of R? of Lebesgue measure zero. Indeed, the above condition can be seen to be 
both necessary and sufficient for the independence of two jointly continuous random variables. 


13.4 Conditional pdf for jointly continuous random variables 


We would like to define the conditional cdf, Fx|y(x|y) ~ P(X < x|Y = y). But the event, {Y = y} has zero 
probability Vy, when Y is continuous! To overcome this technical difficulty,we proceed by conditioning on a 
Y taking values in a small interval (y, y + €), and then take the limit € | 0. More concretely, let us consider 
the following derivation, which motivates the definition of the conditional pdf for the jointly continuous case. 
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Informal Motivation 
We can approximately define the conditional CDF of X, given that Y takes a value “close to y” as 


Fyjy (zly) = P(X <al|y<Y<yte) (for small e) 
P(X Sa}ntysY <yteé}) 
P({y<Y <y+e}) 
Py y(x, y +€) — Fx,y(2,y) 
By (ye) = Fy (y) 


Fx,y (t,yte)—F x,y (x,y) 


Fy (yt+e)—Fy (y) 


partial derivative of Fx, x w.r.t Y 
As e€ > 0, the RHS looks like, danae ct ro 


This motivates the following definition for the conditional cdf and conditional pdf. 


Definition 13.5 a) The Conditional cdf of X given Y is defined as follows: 


f fxv(u) 4, 


—co 


b) The Conditional pdf of X given Y is defined as follows: 


x 
fxiy (tly) = xy (2y) for any y such that fy (y) > 0. 
fy) 
c) The Conditional probability of an event A € B(R) given Y = y is defined by 
pike AY =y) =f few(olyee 
A 
= f la(v) fxiy (uly)dv 


where L4(x) is indicator function for the event {x € A}. 


Example: 
Let X,Y be jointly continuous with fx,y (x,y) = 1 in the region shown in Figure 13.3. Find all the marginals 
and conditional distributions. 


One can easily verify that fx y (x,y) is a valid joint pdf by integrating it over the region. 


21 
i.e. IS fxy (2, y)drdy = 1. 
00 


The marginal pdf of Y can be calculated as follows. In the Figure 13.3, the equation of the line is y = — 2x +2. 
So, for a given y, fx,y (x,y) is non zero only in the range x € (0,1 — 4). This can be seen from the Figure 
13.3. 


Thus, 
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(0, 2) 
fxv(x, y) =1 


(1,0) 


y+ 


Figure 13.3: 


Similarly, the marginal pdf of X can be calculated as follows. From the line equation, y = —2x + 2, we can 
find that for a given x, fx y(x,y) is non zero only in the range y € (0,2 — 2x). Thus, 


2—2 


ix@)= f Idy=2-22 0<z<1l. 
0 


The marginals of Y and X have been plotted in Figure 13.4. 


Now, lets find the conditional pdf, fx;y(2|y). Here, we are computing a function of x for a given(fixed) y. 


fxy(z,y) 
ieee) = = 
| fy(y) 
_ 1 
-4 
2 
= x Q 1 2) ; 
2— y 2 
Though x does not appear in the expression, it appears in the constraint. So this is a conditionally uniform 


distribution in x i.e., given {Y = y}, X is uniformly distributed in x € (0,1 — 4) as fxjy(æly) is constant 


2 
w.r.t x for the specified range. 
Similarly, we can find, 


y € (0,2 — 2x). 


1 
frix (ylz) = 5 ag 
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=n 
= 


— 


N 
ra 


Figure 13.4: Marginal pdf of X and Y 


13.5 Exercises 


1. Two persons X and Y live in cities A and B but work in cities B and A respectively. Every morning 
they start for work at a uniformly random time between 9 am and 10 am independent of each other. 
Both of them travel at the same constant speed and it takes 20 minutes to reach the other city. What 
is the probability that X and Y meet each other on their way to work ? 


2. Data is taken on the height and shoe size of a sample of MIT students. Height (X) is coded by 3 
values: 1 (short), 2 (average), 3 (tall) and Shoe size (Y) is coded by 3 values 1 (small), 2 (average), 3 
(large). The joint counts are given in the following table: 


X=1 | X=2 | X =3 
Y=1 | 234 | 225 84 
Y=2 | 180 | 453 161 
Y=3 | 39 192 157 


(a) Find the joint and marginal pmf of X and Y. 
(b) Are X and Y independent? Discuss in detail. 


3. John is vacationing in Monte Carlo. Each evening, the amount of money he takes to the casino is a 
random variable X with the pdf 


fra) = { Cx 0<-2 <100 


0 elsewhere. 


At the end of each night, the amount Y he returns with is uniformly distributed between zero and 
twice the amount he came to casino with. 
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(a) Find the value of C. 
(b) For a fixed a, 0 < a <100, what is the conditional pdf of Y given X =a? 


(c) If John goes to the casino with a dollars, what is the probability he returns with more than a 
dollars ? 


(d) Determine the joint pdf, fx,y (x,y), of X and Y as well as the marginal pdf, fy (y), of Y. 


A rod is broken at two points that are chosen uniformly and independently at random. What is the 
probability that the three resulting pieces form a triangle? 


. [https : //engineering.purdue.edu/ ipollak /ece302/.../problems/problems_A] Melvin Fooch, a student 


of probability theory, has found that the hours he spends working (W) and sleeping (S) in preparation 
for a final exam are random variables described by: 


fw,s(w, s8) = { 


10<w+s< 20andw > 0,s>0 
elsewhere. 


What poor Melvin does not know, and even his best friends will not tell him, is that working only 
furthers his confusion and that his grade, G, can be described by G = 2.5(.S — W) + 50. 


(a) The instructor has decided to pass Melvin if, on the exam, he achieves G > 75. What is the 
probability that this will occur? 


(b) Suppose Melvin got a grade greater than or equal to 75 on the exam. Determine the conditional 
probability that he spent less than one hour working in preparation for this exam. 


(c) Are the random variables W and S independent? Justify. 


. [https : //engineering.purdue.edu/ ipollak/ece302/.../problems/problems_4] Stations A and B are 


connected by two parallel message channels. A message from A to B is sent over both channels at 
the same time. Continuous random variables X and Y represent the message delays (in hours) over 
parallel channels I and II, respectively. These two random variables are independent, and both are 
uniformly distributed from 0 to 1 hours. A message is considered received as soon as it arrives on any 
one channel, and it is considered verified as soon as it has arrived over both channels. 


(a) Determine the probability that a message is received within 15 minutes after it is sent. 

(b) Determine the probability that the message is received but not verified within 15 minutes after it 
is sent. 

(c) If the attendant at B goes home 15 minutes after the message is received, what is the probability 
that he is present when the message should be verified? 


. [https : //engineering.purdue.edu/ ipollak/ece302/.../problems/problems_A] Random variables B 


and C are jointly uniform over a 2l x 2l square centered at the origin, i.e., B and C have the following 
joint probability density function: 


4, tebe rere ee | 
oy 4l2? 7 
fe,c(b,c) = { 0, elsewhere. 


It is given that l > 1. Find the probability that the quadratic equation x? + 2Bx + C = 0 has real 
roots (Answer will be an expression involving l). What is the limit of this probability as l + oo? 


(a) [Dimitri P.Bertsekas] Consider four independent rolls of a 6-sided die. Let X be the number of 
1’s and let Y be the number of 2’s obtained. What is the joint PMF of X and Y? 

(b) [MIT OCW problem set] Let X1, X2, X3 be independent random variables, uniformly distributed 
on [0,1]. Let Y be the median of Xı , X2, Xs (that is the middle of the three values). Find the 
conditional CDF of X1, given that Y = 0.5. Under this conditional distribution, is X1 continuous? 
Discrete? 
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Lecture 14: Introduction to Transformation of Random Variables 
Lecturer: Dr. Krishna Jagannathan Scribe: Arjun Nadh 


Suppose we are able to observe a random variable or a collection of random variables. In many practical 
situations, we may be more interested in some function of the observed random variable(s). For example, in 
communication systems, the logarithm of the noise power is often more useful to an engineer than the noise 
realisation itself. 


Let X be a random variable on (Q, F, P) and f : R —> R be a function. We are interested in characterising 
the properties of f(X). Since random variable X is itself a function, f(X) is a composed function that maps 
Q to R. First, we have to ask if f(X) is indeed a legitimate random variable. Consider the composed function 
foX(-), depicted in Figure . If f is an arbitrary function, f(X) may not be a random variable. However, if 
f: R —> R is a Borel-measurable function (i.e., pre-images of Borel sets under f are also Borel sets), then 
it is clear that the pre-images of Borel sets under the composed function f o X(-) are events (why?), and it 
follows that f(X) is indeed a random variable. Similarly, for a Borel-measurable function f : R” — R, and 
random variables X1, X2, X3,...,Xn, it can be argued that f(X1,...,Xn) is a random variable. 


Figure 14.1: Transformation of random variable 


Now that we have established conditions under which a function of a random variable is a random variable, 
we ask after the probability law of f(X), given the probability law Px of X. Equivalently, given the CDF of 
X, we want to find the CDF of f(X). We begin by considering some elementary functions such as maximum, 
minimum, and summations, and then proceed to more general transformations. 


14.1 Maximum and Minimum 


Let X1, X2, X3, ..., Xn be random variables on (Q, F, P) with joint CDF Fx, x, .....x,,- Define 
Yn = min(X1, X2, X3, Dk ,Xn). 
and 


Zn, = max(X1, X2, X3, os ag Nn): 
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Here we are interested in finding the CDF of Y, and Zn. 


First let us check that Zn is indeed a random variable. Note that {Z, < x} is equivalent to saying that each 
of X1, X2, X3,..., Xn is less than or equal to x. Thus, we have, 


{Z, <a} ={X1 <4, Xo <a,...,Xn <r}. 


Now, in order to see that {w : Z(w) < z} is an event, note that {w : Z(w) < z} = N {w : Xi(w < z)}. This 
i=1 


Somes 
is a finite intersection of events, since the X;s are random variables. Therefore, Zn is a legitimate random 
variable. 


Next, for the minimum, note that the if {Yn > x} is equivalent to saying that each of X1, X2, X3,..., Xn is 
greater than x. Thus, 


{Yn > a} ={X1 >2,X2>2,...,Xn > zh. 


We can prove that Y, is also a random variable, by using arguments similar to those used for proving that 
Zn is a random variable. 


We now proceed to compute the CDF of random variables, Zn and Y;. 


P({Zn < x}) = PX: <a} N {X2 < £} -N {Xn < ah), 


= Px, Xs os o beg MAR 
Similarly for Yn, 
P({Y, > £}) = P({X1 >£} N {Xo >£} O{Xn > x}}), 


Fy, (x) =1— Fy, (x), 


= Fx, X2....X (B52; toe sa), 


where Fy, x,....x,,(-) denotes the joint complementary CDF. 


jise 


In particular if X1, X2,..., Xn are independent 
Fz, (x) = Fx, (x)Fx, (x) wee Fx, (x). 
Fy, (x) = Fy, (x) Fx, (x) wee Fx, (x). 
Further if they are i.i.d (independent and identically distributed), then 


Fz, (x) = [Fx(x)]". 
Fy, (2) = [Fx 2)". 


Example 1:- Consider U1, U2 to be i.i.d, Unif[0, 1], 

Let Y = min(U1, U2) and Z = max(U1, U2). 

Let Fy, (z) and Fy,(z) be CDF’s of random variables U; and U respectively. Since they are identically 
distributed 


Fu, (2) = Fu, (z) = [Fu (2)], 
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where 
0 z<0, 
Fu(z)=<2 2z€ (0,1), 
1 z>1 


0 <z<0, 
[Fu(z)? = 42? ze [0,1], 
1 z>1 


Its pdf is given by 


f(z) = 


0 otherwise. 


ts z € (0, 1], 


Similarly for Fy (y) we can write 
Fy (y) = Fu, (2) Fo, (2) = [Fu (y), 


where Fy (y) denotes the complementary CDF of Y. 


0 y <0, 
Fy(y)=41-(1-y) ye [0,1], 
1 y>1. 
The pdf is given by 
0 y <0, 
fu) = 420-9) ye [0,1], 
1 y>1. 


Example 2:- 


Let X1, X2, X3,..., Xn be independent random variables which are exponentially distributed with the pa- 
rameters Aj, A2, A3,---;An > 0. Fx,(z) = 1—e7** for x > 0. 


Let 


Ys, = min(X1, X2, saN gy) 
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Then the complementary CDF of Yn : 


We can see that Y, is an exponential random variable with parameter A; + A2 + A3 +--+: + An. Thus, the 
minimum of independent exponential random variables is another exponential random variable! 


14.2 Exercises 


1. Light bulbs with Amnesia: Suppose that n light bulbs in a room are switched on at the same instant. 
The life time of each bulb is exponentially distributed with parameter u = 1, and are independent. 


(a) Starting from the time they are switched on, find the distribution of the time when the first bulb 
fuses out. 


(b) Find the CDF and the density of the time when the room goes completely dark. 
(c) Would your answers to the above parts change if the bulbs were not switched on at the same 


time, but instead, turned on at arbitrary times? Assume however that all bulbs were turned on 
before the first one fused out. 


(d) Suppose you walk into the room and find m bulbs glowing. Starting from the instant of your 
walking in, what is the distribution of the time it takes until you see a bulb blow out? 


2. Let X and Y be independent exponentially distributed random variables with parameters À and u 
respectively. 


(a) Show that Z = min(X, Y) is independent of the event {X < Y}, and interpret this result 
verbally? [Definition: A random variable X is said to be independent of an event A if X and I, 
are independent random variables, where I; is the Indicator random variable of the event A.] 


(b) Find P(X = Z). 
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Lecture 15: Sums of Random Variables 


Lecturer: Dr. Krishna Jagannathan Scribes: R.Ravi Kiran 


15.1 Sum of Two Random Variables 


In this section, we will study the distribution of the sum of two random variables. Before we discuss their 
distributions, we will first need to establish that the sum of two random variables is indeed a random variable. 


Theorem 15.1 Let X andY be random variables defined on a probability space (Q,F,P) and define Z(w) = 
X(w)+Y (w), Vw EQ. Then, Z is a random variable. 
Proof: To prove that Z is a random variable, we need to show that {w € Q : Z(w) > z} EF, Vz ER. 


Now, Vz € R, Z(w) > z if and only if there exists a rational q such that X(w) > q and Y(w) > z — q. This 
follows from the fact that the set of rationals is dense in R. Thus, 


{weEX:Zw)>z} = J {w €2: Xw) >g Y lw) > z-a} 
qEQ 
= LU dw e2: XW) > an {w EQ: Yw) > z-4q}). (15.1) 
qEQ 


We know that Yq € Q, {w E Q : X(w) >q} N {w EQ: Y(w) >z -— q} € F because X and Y are random 
variables. Since the set of rationals is countable, we have a countable union of sets from F, which should 
also be in F as it is a o-algebra. Thus, {w € Q : Z(w) > z} € F, proving that the sum, Z = X +Y isa 
random variable. E 


We will now start with random variables in the discrete domain. Assume that X and Y are discrete random 
variables with a known joint pmf px y(-). Let the random variable Z be defined as Z = X +Y. We will now 
characterize the pmf of Z, pz(-) : 


pz(z) = P(Z=2z) 
= SO pxy(a,y) 
T y= 
= Ñ P(X =xr,Y =z- 7) (15.2) 


II 


X pxy(z, z— zr) 
In particular, if X and Y are independent, the pmf of Z simplifies to 


pz(z) = $ _px(£)py (z — z), (15.3) 


which is simply the discrete convolution of the two pmfs. 


Let us now look at an example. 
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Example 15.2 Let X and Y be independent, random variables with distributions given by Pois(A) and 
Pois(u) respectively. Define Z = X +Y. Then, the pmf of Z, can be computed, by invoking (15.3) : 


z —XA\@ o—H; ZT 
e *A\* e Fu 
BED eal 


z! (z-r)! 
=(à+4) Ž 
e Z T, ZT 
~ z! 2 G) 5 
e AFH) (N + u)? 
z! 


The above computation establishes that the sum of two independent Poisson distributed random variables, 
with mean values and u, also has Poisson distribution of mean À + u. 


We can easily extend the same derivation to the case of a finite sum of independent Poisson distributed 


random variables. 


Next, we consider the case of two jointly continuous random variables. Assume that X and Y are jointly 
continuous random variables, with joint pdf given by fx y(#,y). Let Z = X +Y. Then, 


Fz(z) = P(Z<z) 
= P(X+Y <2) 


= i ( J H ives) dx (15.4) 
[( [torent ont) 


= a te fuvtot— ajde) dt. (15.5) 
i nc, 
fa(t) 


From (15.5), we can see that the pdf of Z is given by fz(z) = f°. fx,y(a,2— 2)dz. 


In the special case of X and Y being independent continuous random variables, we get 
fale) = [ fx (x) fy (z— x)dx = fx * fy, (15.6) 


which is the convolution of the two marginal pdfs. 


Example 15.3 Assume that Xı and Xə are independent exponential random variables with parameters u1 
and pg respectively. Let Z = X1+X. Using (15.6) and the fact that the support for the exponential random 
variable is R* U {0}, we get, 


f2(z) = fx * fx» 


z 
= | pe ge Edr, 
0 


zZ 
= pape | elH2—B1)® dgr. 
0 
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We can see from the above integral that 


fz(z)= ee (as E g if m # Ma, 
pze”? pa = po = b. 


In fact, the process can be extended to the case of a sum of a finite number n of random variables of 
distribution exzp(u), and we can observe that the pdf of the sum, Zn, is given by Erlang (n, u), i.e, 


n yn—1,—pz 


fz, (2) = aaa (15.7) 


The above example describes the process of computing the pdf of a sum of continuous random variables. 


The methods described above can be easily extended to deal with finite sums of random variables too. 


15.2 Sum of a random number of random variables 


In this section, we consider a sum of independent random variables, where the number of terms in the 
summation is itself random. Let N be a positive integer valued random variable on (Q, F, P) with known pmf 
P(N =n). Let X1, Xo,..., be independent random variables on the same probability space, (Q, F, P), with 
distributions, Fx, (.), Fx,(.),..., respectively. Further, we will assume that N is independent of {X;,i > 1}. 


N N(w) 
Define, Sy = >> X;. That is, Sy(w) = Xj (w),Vw € Q. The cdf of Sy can be computed as follows : 
i=l i=1 
Fsy(x) = P(Sn <2), 
= SoP(Sv <2|N = k)P(N =k), 
k=1 
= > P(S, < 2)P(N = k), (15.8) 
k=1 


where (15.8) follows from the independence of N and the Xjs. 


In the above expression, we know how to compute P(S; < x) from the previous section. Thus we have essen- 
tially computed the distribution of the random sum of random variables under the specified independence 
assumptions. 


The following example is quite instructive. 
Example 15.4 Geometric Sum of Exponentials : 


Let X;,Vi > 1 be independent random variables with distribution exp(y). Let N be a positive integer valued 
random variable of geometric distribution with parameter p. 


N 
Define Sn = X` Xj. We will now determine the pdf of Sn. 
i=1 


We know that P(N = k) = (1— p)‘-!p,Vk > 1. Further we observed earlier (15.7) that the sum of k 
k 


exponential distributions of mean ae Sk = >> Xj, is ak” order Erlang distribution. Thus, using this and 
i=1 
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(15.8), we get, 


F(a) = P(Sy < x), 

= SPIN = k)Fs,(2), 
s sı 

z 5 (ot 19 (1 Ze or), 
k= n=0 
oo oo k-1 

= $ p-p- S$ p(l- p)" e hoar), 
k= k=1 n=0 

= p (Ut) pS 


The above derivation establishes that the geometric sum of exponentials has an exponential distribution with 
parameter w = pp. 


Consider a radioactive source emitting œ particles where the time between two successive emissions is €x- 
ponentially distributed with parameter À. Whenever there is an emission, the detector detects it with 
probability p and misses it with probability 1 — p independent of other detections. So it can be easily 
seen that the time between two successive detections is indeed a geometric sum of i.i.d exponential random 
variables which itself is an exponential random variable with parameter pA as seen in the above example. 


The above study gives a detailed account of the random sum of random variables under the strict indepen- 
dence constraints earlier assumed. It is however possible to envision a scenario where the random number 
N is dependent on the observations, X; themselves. 


For instance let us assume that a gambler plays a game repeatedly and is rewarded or penalized in each 
round. Say the gambler stops only when he is “satisfied” (or “broke” ) with the overall outcome of the game. 
Let X; be the amount he gains (or loses) in round i of the game. In this scenario, analysing the overall sum 
earned by the gambler at the end of his game is complicated by the dependence of N on the outcomes. This 
scenario motivates the theory of stopping rules, which shall be covered in a more advanced course (EE6150). 


15.3 Exercise: 


1. Let X, and Xz be independent random variables with distributions N (0, 07) and M (0, 03) respectively. 
Show that the distribution of Xı + X2 is (0,0? +03). 


2. Consider two independent and identically distributed discrete random variables X and Y . Assume 
that their common PMF, denoted by p(z), is symmetric around zero, i.e., p(z) = p(—z), Vz. Show that 
the PMF of X +Y is also symmetric around zero and is largest at zero. 


3. Suppose X and Y are independent random variables with Z = X +Y such that fx(x) = ce“, x > 0 
and fz(z) =c?ze~“, z > 0. Compute fy (y). 
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4. Let Xı and X2 be the number of calls arriving at a switching centre from two different localities at 
a given instant of time. Xı and Xə are well modelled as independent Poisson random variables with 
parameters A; and Az respectively. 


(a) Find the PMF of the total number of calls arriving at the switching centre. 
(b) Find the conditional PMF of X; given the total number of calls arriving at the switching centre 
is n. 
5. The random variables X, Y and Z are independent and uniformly distributed between zero and one. 
Find the PDF of X +Y +Z. 


6. Construct an example to show that the sum of a random number of independent normal random 
variables is not normal. 
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Lecture 16: General Transformations of Random Variables 


Lecturer: Dr. Krishna Jagannathan Scribe: Ajay and Jainam 


In the previous lectures, we have seen few elementary transformations such as sums of random variables 
as well as maximum and minimum of random variables. Now we will look at general transformations of 
random variables. The motivation behind transformation of a random variable is illustrated by the following 
example. Consider a situation where the velocity of a particle is distributed according to a random variable 
V. Based on a particular realisation of the velocity, there will be a corresponding value of kinetic energy E 
and we are interested in the distribution of kinetic energy. Clearly, this is a scenario where we are asking 
for the distribution of a new random variable, which depends on the original random variable through a 
transformation. Such situations occur often in practical applications. 


(Q, F,P) Se Sree fo X(-) 


Figure 16.1: Transformation of random variable 


16.1 Transformations of a Single Random Variable 


Consider a random variable X : Q — R and let g : R —> R be a Borel measurable function. Then Y = g (X) 
is also a random variable and we wish to find the distribution of Y. Specifically, we are interested in finding 
the CDF Fy (y) given the CDF Fy (x). 


Fy (y) = P(g(X) < y) = Pwlg(X(w)) < y}). 
Let By be the set of all x such that g(x) < y. Then Fy (y) = Px(B,). 
We now illustrate this with the help of an example. 


Example 1: Let X be a Gaussian random variable of mean 0 and variance 1i. e. X ~ N (0,1). Find the 
distribution of Y = X?. 
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Solution: 


Va 
Fy(y) = P(X? <y) = P-VWy¥s X < vy) = / 
Vi 


where ® is the CDF of N(0,1). 
Now, 


From above, 


Note: 


1. The random variable Y can take only non-negative values as it is square of a real valued random 
variable. 


2. The distribution of square of the Gaussian random variable, fy(y), is also known as Chi-squared 
distribution. 


Thus, we see that given the distribution of a random variable X, the distribution of any function of X can 
be obtained by first principles. We now come up with a direct formula to find the distribution of a function 
of the random variable in the cases where the function is differentiable and monotonic. 


Let X have a density fx(x) and g be a monotonically increasing function and let Y = g(X). We then have 
Fry) =PY <u) = P(X) <y) = f floare. 


Note that as g is a monotonically increasing function, g(x) < y => x < g7! (y). 


Let x = g7! (t), so g'(x)dx = dt. 


E dt 
Fy (y) z. fx(g Gay: 
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Differentiating, we get i 
= -1 


The second term on the right hand side of the above equation is referred to as the Jacobian of the transfor- 
mation g(-). 


It can be shown easily that a similar argument holds for a monotonically decreasing function g as well and 
we obtain 


A -1 
fy(y) = fx(g U 


Hence, the general formula for distribution of monotonic functions of random variables is as under 


_ fx) 
| a’(g-*(y)) | 


Example 2: . Let X ~ N(0,1). Find the distribution of Y = e*. 


fr) (16.1) 


Solution: Note that the function g(x) = e” is a differentiable, monotonically increasing function. 


As g(x) = e”, we have g-'(y) = ln(y) and g'(g71(y)) = y. Here we see that the Jacobian will be positive 


for all vaues of y and hence | g'(g-*(y)) | = g'(g-*(y)) a) 
Finally we have 
fx (In(y)) 1 =(in(y))? 
fra) = = e >? for y>0. 
z) y yv 20 


This is the log-normal pdf. 


Example 3: Let U ~ unif|[0,1] i.e. U is a uniform random variable in the interval [0,1]. Find the distri- 
bution Y = —In(U). 


Solution: Note that g(u) = —In(u) is a differentiable, monotonically decreasing function. 


Here we see that the Jacobian will be 
1 


ETIN 


As g(u) = —In(u), we have g~'(y) = e™” and g'(g™`(y)) = 3 
negative for all values of y and hence | g'(g~'(y)) |= —g'(g~*(y) 


S| 
II 


Hence we have 


fy(y) = I = =e” fory=0. 


e-y e-y 


Note that Y is an exponential random variable with mean 1. 


If X is a continous random variable with CDF F'x(-), then it can be shown that the random variable 
Y = F(X) is uniformly distributed over [0,1] (see Exercise 2(a)). It can be seen from this result that any 
continuous random variable Y can be generated from a uniform random variable X ~ unif[0,1] by the 
transformation Z = Fy-'(X) where Fy(-) is the CDF of the random variable Y. 


16.2 Transformation of Multiple Random Variables 


Equation 16.1 can be extended to transformations of multiple random variables. Consider an n-tuple 
random variable (X1, X2,..., Xn) whose joint density is given by f(x,,x5,...,x,)(@1,%2,---;2n) and the 


suas 
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OYO 


(X(@),Y(@)) 


Figure 16.2: Mapping of a realization (X(w), Y(w)) to the polar co-ordinates (R(w), O(w)) 


corresponding transformations are given by Yı = gi(X1, Xo2,..-,Xn), Y2 = go(X1, X2,...,Xn),---, Yn = 
Gn(X1, X2,...,Xn). Succinctly, we denote this as a vector transformation Y = g(X), where g: R” > R”. 
We assume that the transformation g is invertible, and continuously differentiable. Under this assumption, 
the joint density of fy, ¥5,....v, (Y1, Y2,..., Yn) is given by (see Section 2.2 in Lecture 10 of [1]) 


FE Po (Vs Y2,- -3 Yn) = Fx, X2, Xn) (IT UII), (16.2) 


where |J(y)| is the Jacobian matrix, given by 


Ot, Oxo Orn 

fa) fa) 5 : fa) 

eo, ee ban 

Oy2 Oyo ` i ~ yz 
J(y) = 

Ox, 0x2 tn 

Oyn Oyn ` ` ` OYn 


We now illustrate this with the help of an example. 


Example 4: Let the Euclidean co-ordinates of a particle be drawn from identically distributed independent 
Gaussian random variables of mean 0 and varinace 1 i.e., X,Y ~ N(0,1). Find the distribution of the 
particle’s polar co-ordinates, R and O. 


Solution: The corresponding transformations are given by X = RcosO and Y = RsinO. 
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Let us first evaluate the Jacobian. We have x = rcos@ and y = rsin. So we have de = cosl, By = sing, 


Ox a : Oy _ ; 
g = ~7sind and 55 = r cos®ð. 
Ou Oy 2 
Z + cos@ sin ; : 
J=] & g = . = recos? 0 +rsin?0 = r(cos? 8 + sin? 0) =r. 
FA aa rsin@ rcos@ 


Next, we have X,Y ~ N(0,1) and they are independent so 


1 


fxv(ey) = fxla)fyry) = se ayer. 


From (16.2), we have 


1 _ (mcos0)?+(r sin 0)? 
aa ae 


fro(r, 0) = =e 


x r where r > 0 and 0 € [0,27]. 
27 


The marginal densities of R and © can be obtained from the joint distribution as given below. 


27 


fr(r) = J trot 0)d0 = ren” /2 forr>0. 
0 


fo(@) = J frolr.8)ar = = for 6 € (0, 27]. 
0 


The distribution fr(r) is called the Rayleigh distribution, which is encountered quite often in Wireless Com- 
munications to model the gain of a fading channel. Note that the random variables R and © are independent 
since the joint distribution factorizes into the product of the marginals i.e. 


fro(7, 9) = fr(r) x fo(9). 


We now illustrate how transformations of random variables help us to generate random variables with 
different distributions given that we can generate only uniform random variables. Specifically, consider the 
case where all we can generate is a uniform random variable between 0 and 1 7. e. unif[0,1] and we wish 
to generate random variables having Rayleigh, exponential and Guassian distribution. 


Generate U; and U2 as ii.d. unif[0,1]. Next, let © = 27U, and Z = — tn) It can be verified that 
© ~ Unif[0, 27] and Z ~ Exp(0.5). 


Thereafter, let R = VZ. It can be shown that R is a Rayleigh distributed random variable (see Exercise 1). 
Lastly, let X = RceosO and Y = RsinO. It is easy to see from Example 3 that X and Y will be i.i.d. 


N (0,1). 
16.3 Exercises 


1. Let X ~ exp(0.5). Prove that Y = VX is a Rayleigh distributed random variable. 


2. (a) Let X be a random variable with a continuous distribution F. 
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(i) Show that the Random Variable Y = F(X) is uniformly distributed over [0,1]. [Hint: Al- 
though F is the distribution of X, regard it simply as a function satisfying certain properties 
required to make it a CDF !] 

(ii) Now, given that Y = y, a random variable Z is distributed as Geometric with parameter 
y. Find the unconditional PMF of Z. Also, given Z = z for some z > 1,z € N find the 
conditional PMF of Y. 


(b) Let X be a continuous random variable with the pdf 


f(a) = { en 


0 «<0. 


Find the transformation Y=g(X) such that the pdf of Y will be 


1 
zz 0<y<1 

=} 2vY 
fy) { 0 otherwise. 


[Hint: Question l(a) might be of use here !] 


3. Suppose X and Y are independent Gaussian random variables with zero mean and variance a”. Show 
that + is cauchy. 


4. (a) Particles are subject to collisions that cause them to split into two parts with each part a fraction of 
the parent. Suppose that this fraction is uniformly distributed between 0 and 1. Following a single 
particle through several splittings we obtain a fraction of the original particle Zn = X1X2...Xn 
where each X; is uniformly distributed between 0 and 1. Show that the density for the random 


variable Zn is, 
1 


fn(2) = ea eo) 


(n— 
(b) Suppose X and Y are independent exponential random variables with same parameter A. Derive 
the pdf of the random variable Z = i. 


5. A random variable Y has the pdf fy (y) = Ky~+),y > 2 (and zero otherwise), where b > 0. This 
random variable is obtained as the monotonically increasing transformation Y = g(X) of the random 
variable X with pdf e~*,x > 0. 

(a) Determine K in terms of b. 
(b) Determine the transformation g(.) in terms of b. 
6. (a) Two particles start from the same point on a two-dimensional plane, and move with speed V each, 


such that the angle between them is uniformly distributed in [0,27]. Find the distribution of the 
magnitude of the relative velocity between the two particles. 


(b) A point is picked uniformly from inside a unit circle. What is the density of R, the distance of 
the point from the center? 


7. Let X and Y be independent exponentially distributed random variables with parameter 1. Find the 


joint density of U = X +Y and V = rend and show that V is uniformly distributed. 


References 


[1] DAVID GAMARNICK AND JOHN TSITSIKLIS, “Introduction to Probability”, MIT OCW, , 2008. 


EE5110: Probability Foundations for Electrical Engineers July-November 2015 


Lecture 17: Integration and Expectation 
Lecturer: Dr. Krishna Jagannathan Scribe: Gopal Krishna Kamath M 


In this chapter, we introduce abstract integration, and in particular, define the integral of a measurable 
function, with respect to a measure. As a special case, the integral of a random variable with respect to a 
probability measure is known as the expectation of the random variable. 


Our approach to defining the expectation of a random variable as an abstract integral serves to unify the 
definition. After all, you may recall from your undergraduate study of probability that the expectation of 
a random variable is defined via an integral if the random variable is continuous, and a summation if it is 
discrete. Of course, if the random variable were singular or a mixture, the elementary approach does not 
provide a simple definition of the expectation. On the other hand, the definition we are about to give is 
completely general; specifically, we do not have to provide separate definitions for different types of random 
variables. 


In addition, the theory of abstract integration allows us to define the Lebesgue integral, which generalizes 
the notion of the Riemann integral from high-school calculus. 


17.1 The Riemann Integral: A Review 


Consider a function f : R — R. Let [a,b] be an interval in the domain of f, and o = {01,02,...,0n} bea 
partition of [a,b]. The lower and upper Riemann sums, denoted by Ln and Un respectively, are defined as 
below: 


L, + ( inf T ) Ati, 
>, xElzi Pad! ) 
U, ê y ( sup ro) Ati. 
i=1 \®E[@i,wi41] 


As n increases in a manner such that each Az; decreases to zero, it can be seen that L, is monotone 
increasing, while U, is monotone decreasing. So, as n — oo it follows that Ln and Un will both converge. 
The limits of Ln and Un are called the lower and upper Riemann integrals, respectively. That is, 


fro) dr = lim In, 


n= o0 
a 


n= oo 


fro dx ê lim Un. 


It can be shown that the values of the Lower and Upper Riemann Integrals do not depend on the choice of 
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Figure 17.1: An arbitrary function f over the interval fa, b]. 


f(x) 


Figure 17.2: Lower Riemann approximation of f. 
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Figure 17.3: Upper Riemann approximation of f. 


the partition. It is also clear that the following relation always holds 


[i dx < [re dz. 


a 


Definition 17.1 A function f is said to be Riemann Integrable if the values of the Lower Riemann Integral 
and the Upper Riemann Integral coincide. In such a case, the Riemann integral of f is that common value. 


b 
That is, the Riemann Integral of f (when it exists), denoted by f f(x) dx, is given by 


fro da: - fi dx = [i da. 


Figure (17.1) shows the graph of an arbitrary function f over the interval [a,b]. Figures (17.2) and (17.3) show 
the lower and upper Riemann approximations of f respectively, wherein f is graphed in red for reference. 
The lower (resp. upper) Riemann sum is the area under the lower (resp. upper) Riemann approximation in 
figure (17.2) (resp. (17.3)). We can imagine the Lower (resp. Upper) Riemann Sum to be approximating 
the area under f from below (resp. above). The intuition is that as the partitions become “finer”, the Upper 
and Lower Riemann Sums converge to the area under f from above and below respectively. Also notice that 
figures (17.2) and (17.3) also represent unequal partition sizes pictorially. 
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Next, we turn our attention to Abstract Integrals. 


17.2 Abstract Integration 


Let (Q,F, u) be a measure space, and f : Q —> R be a F-measurable function. For any A € F, we would 


like to define 
J f dp. 


A 


We will call the above quantity the integral of f with respect to the measure u over the F-measurable set 
A. Also, in the interest of notational simplicity, we will use the following two notations interchangeably to 
mean the integral of the F-measurable function f with respect to the measure p over the entire space 


ftoun= ffan 
Q 


Before we define the abstract integral, let us look at two very important special cases. 


17.2.1 Special Cases 


1. The Lebesgue integral: Let (Q, F, u) = (R, B(R), à) 
Let f : R — R be a Borel-measurable function, and in this case the integral 


fra 


is called the Lebesgue Integral of f over the Reals. 

The Lebesgue integral can be shown to be a generalisation of the Riemann integral. In particular, it 
allows us to integrate over arbitrary Borel sets, instead of just intervals. Moreover, we will see that 
the Lebesgue integral might exist even when the Riemann integral does not. However, if a function is 
Riemann integrable over an interval, then it is necessarily Lebesgue integrable, and the values of the 
two integrals will be equal. 


2. Expectation of a random variable: Let (Q,F, u) = (Q,F,P) 
Let X :Q— R be a random variable, and in this case the integral 


fxæ, 


is called the Expectation of the random variable X, and is denoted by E[X]. Therefore, 


|X] £ fx dP. 


Note that, so far, we have not defined what an abstract integral is. We have only introduced the notations 
and terminologies used. We now lay out the roadmap to defining the abstract integral. 
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17.2.2 Roadmap for defining the abstract integral 


The abstract integral of an arbitrary, F-measurable function f is defined in four steps as outlined below: 


1. First, we define the integral for simple functions, i.e., non-negative functions that take only finitely 
many values. 


2. Second, we define the integral for non-negative functions. This is done by approximating the function 
by simple functions, thus allowing us to define the integral of the non-negative function in terms of the 


integrals of the simple functions. 


3. Third, we write the arbitrary function f as f = f+ - f-, where f} and f_ are non-negative functions 
which correspond to the positive and negative components of f. Then, we define the integral of f as 


ftoun= ft an- ft- an 
fione f fla an 
A 


17.2.3 Abstract Integrals of Simple Functions 


4. And last, we define 


Let (Q, F, u) be a measure space and f : Q —> R be a F-measurable function. 


Definition 17.2 A function f is said to be a simple function if it can be written as 
fw) =X alaw), Vw €Q, (17.1) 
i=1 
where a; > 0V i€ {1,2,...,n}, and A; E F Vie {1,2,...,n}. 


Remark: 17.3 Note that f(w) written in this form is not unique. This problem is circumvented using the 
“canonical” representation, wherein we restrict the a;’s to be distinct and the A;’s to be disjoint. It can be 
verified that this restriction enforces uniqueness of representation. Henceforth, whenever the term “simple 
function” is used, it will be taken implicitly to be in the canonical form. 


Figure (17.4) shows the canonical representation of a simple random variable X taking 4 values such that 
w E€ A, = X(w) =a; Vi € {1,2,3,4}. As the figure shows, disjoint events are mapped to distinct, 
non-negative real numbers. 


Definition 17.4 Let (Q,F,) be a measure space and let f > 0 be a simple function with the canonical 
representation (17.2). The abstract integral of f with respect to the measure u is defined as 


[fue Dana. 
w=1 
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Figure 17.4: Canonical representation of a Simple Random Variable taking 4 values. 


Example 1:- Consider the measure space (R, B(R), A), and define f(w) = u(w) + u(w — 1) - 2u(w — 3), 
where u(.) is the Heavyside Step function. The canonical representation of this simple function is f(w) = 
Ijo] + 2lf1,3). Therefore, the Lebesgue integral of this function is 


fra = 1x d([0,1}) +2 AL, 
1x1+2x 2, 
=o De 


II 


We note that the value of the integral equals the area under the curve. 

Example 2:- Consider the probability space (Q = {H,T}",F,P) and let X : Q > R be a simple random 
variable such that P({H}) = p. This can be considered a model for n independent coin tosses. If X (w) 
represents the number of heads, then the expected value of X (i.e. the integral of X with respect to P) can 
be calculated as 


x] = Sexsi) 
z Difi )ra -a 
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Example 3:- Consider the Dirichlet function (D) defined to take on the value 1 on the rationals in the 
interval [0,1] and the value 0 on the irrationals in the interval [0,1]. For this function, 


[re dz = 0, 


Hence, the Dirichlet function is not Riemann Integrable. On the other hand, the Dirichlet function is a 
simple function with the canonical representation 


D(w) = Tonjo,1]- 


Therefore, the Lebesgue Integral of the Dirichlet function is given by 


[re d\ = 1x MON (0,1), 
= 0. 
This is because every partition of the horizontal axis, no matter how fine, contains both rational and irrational 


points. Therefore, we see that the Dirichlet function is trivially Lebesgue Integrable while not being Riemann 
Integrable. 


17.2.4 Abstract Integral of non-negative functions 


Let (Q, F, u) be a measure space, and f : Q — Rx be a non-negative, F-measurable function. Denote by 
S(f) the collection of all simple functions q : Q + R+ such that q(w) < f(w) Vw € Q. That is, given a 
non-negative function f, we collect all the simple functions q’s that approximate f from below. Having done 
this, we now define the abstract integral of f as follows: 


Definition 17.5 The abstract integral of f with respect to the measure u is defined as 


Ji du sup fa du. (17.2) 


qES(f) 


Since q’s are simple functions, calculation of their integral is known. The above equation gives a way to find 
the integral of any non-negative function. While being mathematically well-defined, (17.2) does not directly 
yield a practical method to compute the integral. We will address this issue later. 


17.2.5 Abstract Integral of arbitrary functions 


Let (Q, F, u) be a measure space, and f : Q > R be any arbitrary F-measurable function. Then, in order 
to evaluate f f du, we first write f as f = fy - f-, where f} = max(f,0) > 0 and f- = —min(f,0) > 0. 
We then define the integral of f with respect to u as 


frend ft an- ft an (17.3) 
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wherein the integrals of f} and f_ as calculated as in the previous section. Since f} and f_ are nonnegative 
functions, both integrals on right hand side of (17.3) is well defined. The above definition is meaningful, 
as long as at least one of the integrals on the right hand side of (17.3) is finite. The integral of f is left 
undefined if the integrals of f} and f_ are both infinite. 


17.2.6 Abstract Integral of arbitrary functions over a given set 


Let (Q, F, u) be a measure space, f : Q — R be any arbitrary F-measurable function, and A € F. Define g 
£ fla. That is, we consider the function f restricted to the set A. This is an F-measurable function since 
it is a product of two F-measurable functions. Its integral can be calculated as mentioned in the previous 


section. Therefore, 
[fens tts a= fo du= fo au- fo- dp. 
A 


17.3 Exercises 


1. Consider the measure space (R, B(R), A), and define f : R > R. Find out the Lebesgue integral of the 
function f for the following cases, 
(a) 


w, forw=0,1,..,n 


fw) = { 0, elsewhere. 


=f 1, forw=Q°N (0, 1] 
fw) = { 0, elsewhere. 


fn, forw=Q°N (0,7) 
Fw) -f 0, elsewhere. 


2. Consider the probability space (Q, F, P), and define the random variable X : Q — R. Find E[X] for 
the following cases, 
(a) Q = {w1,w2,..,wn}, with P(w;) = 1/n for i = 1,2,..,n and X = Ty, where A = {w1, w2, .., Wm} 
with 1<m<n. 
(b) In part (a) if X is defined as follows, 


_f i, foruwjeA 
X(wi) ad 0, elsewhere. 
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Lecture 18: Properties of Abstract Integrals 


Lecturer: Dr. Krishna Jagannathan Scribe: Ravi Kumar Kolla 


In this lecture, we discuss some basic properties of abstract integrals. 


18.1 Properties of Abstract Integrals 


We will state the properties for a generic abstract integral, and also particularize for the special case of the 
expectation of a random variable. 


Let (Q, F) be a measurable space and f,g,h be measurable functions from 2 to R. Let u be a generic 
measure and P be a probability measure on (Q, F). Let X,Y be random variables defined on (Q, F, P). The 
first part of each property corresponds to the generic measure (u), while the second part particularizes to 
the probability measure (P). 


In order to prove the properties, we follow a procedure: we begin by proving them for simple functions, and 
then extend it for the case of non-negative functions. Finally using these, we prove the properties for general 
measurable functions. 


[PAI 1] fI4du=p(A), for any A € F. In particular, for any A € F, we have E[l4] = P(A). 


Proof: f(w) = I4 (w) is a simple function and is in canonical form. So, proof directly follows from the 
definition of integral of simple functions (Definition 17.4 from Lecture #17). 


[PAI 2] If g > 0, then f gdp > 0. If X > 0, then E(X) > 0. 


Proof: Let g be a simple function and g > 0. A simple function is of the form equation (17.1) from 
Lecture #17 with all a;’s non-negative. So, f gdp > 0. 


Now, we prove the property for non-negative functions. Let g be a non-negative function. Let S(g) 
contains all simple functions, q(w) such that q(w) < g(w), Vw E Q. 
So, f qdu > 0, Yq € S(g), since q is a simple function. 
Hence, f gdu = sup f qdpu > 0, since supremum of a set of non-negative numbers is non-negative. 
qe S(g) 
E 


[PAI 3] If g = 0, p.a.e., then f gdu = 0. If X = 0 a.s., then E(X) = 0. 
Proof: Let g = 0 w.a.e. be a simple function. Then, g has a canonical representation of the form 
g =X}; ajl_a,, where u (A;) = 0, for each i. Hence, f g du = 0. 
Let g = 0 p.a.e. be a non-negative function. Let q E€ S(g) > qlw) < glw), qlw) > 0, Yq € S(g) and Yw. 


Since, g = 0 we have q(w) = 0, Yw i.e., qlw) = 0, pa... 
Hence, f gdu = 0, Yq € S(g) > f gdp = 0. 


[PAI 4] [Linearity] f (g +h) du = f gdu + f hdu. And, E(X +Y) = E(X) + E(Y). 
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[PAI 5] 


[PAI 6] 
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Proof: Let g and h be simple functions. We can write g and h in canonical representation form as: 


k m 
g =X aila, h= X` djlz,, 
i=1 j=1 


where the sets A; are disjoint, and the sets B; are also disjoint. So, the sets A4; N B; are disjoint. Then, 
g +h can be written as: 


k m 
g+h=0_ X (ai + bj) lans; 


So, 


(ai + bj) u (Ai N Bj), 


S 
© 
+ 
2 
a 
= 
IS 
Mə 


Il 
m 
S. 
Il 
un 


I 
M = 
E 
Ma 
pe 
> 
E) 
D 
X 
M 
= 
leg 
T 
> 
=) 
D 


t=1 j=l j=l w=1 
k m 
(0) 
= 5 aip( Ai) + y bju(B;), 
i=1 j=l 


Where (a) and (c) are due to definition of integral of simple functions, (b) is due to finite additivity of 
L. 

Proving linearity for general non-negative functions is not easy at this point. We will return to finish 
this proof after equipping ourselves with the Monotone Convergence Theorem. 


If0<g<hyp.ae., then f gdu < fhdu. In particular, if 0 < X < Y a.s., then E(X) < E(Y). 
Proof: Let g and h be simple functions and 0 < g < h w.ae.. Then, we have h = g + q, for some 
simple function q > 0 u.a.e.. But, we can write q = q+ — q_, where q+ > 0 and q_ > 0, and q- = 0 


u.a.e.. Here, q, q+}, q- are all simple functions. Using the linearity property [PAI 4] and then properties 
[PAI 3], [PAI 2], we write 


frhan= foant fardu- fa du= foant f aranz fodun 


Let g and h be non-negative functions and 0 < g < h u.a.e.. Let q E S(g) > q(w) < g(w) < h(w)Yw E 


Q => qe S(h). So, S(g) C S(h). Hence, sup fqdu < sup f qdu=> fgdu< fhdp. 
qES (9) qES(h) 


If g =h pae., then f gdu = f hdp. If X =Y a.s., then E(X) = E(Y). 

Proof: The proof follows from the above property since g = h pae. & g < h wae. and h < g u.a.e. 
E 

Example: The Dirichlet function and the zero function are equal p.a.e.. So, they have the same 

integral equal to zero under Lebesgue measure. 


Note that the measure with respect to which the integration is performed on both sides must be the 
same for the equality to hold. 
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[PAI 7] 


[PAI 8] 


Ifg>0pae., and f gdu=0, then g = 0 wae.. If X > 0 a.s., and E(X) = 0, then X =Oas.. 


Proof: Let g be a simple function. Let g > 0 p.a.e., and f gdu = 0. We can write g = g4 — g_, where 
g+ > 0 and g_ > 0. Then, g- = 0 p.a.e.. Using [PAI 3], we get f g- du = 0. 
Due to Linearity property[PAI 4], we can write f g} du = f gdu + f g- du = 0. 


k 
Observe that g4 is a simple function. So, g+ has a canonical representation of the form: g+ = X` aila,, 
i=1 


k 
with a; > 0 for each i. It follows that u(A;) = OVi, since X` aiu(A;) = 0. Due to finite additivity, we 
i=1 


k 
conclude that u (Ù ai) = 0. Therefore, g} = 0 p.a.e., and g = 0 p.a.e.. 
i=1 


Let g be a non-negative function. We use proof by contradiction method to prove this property. 
Suppose the contrary, i.e., B = {w|g(w) > 0}, where p(B) > o: 


Let B, = {w|g(w) > 4}. Clearly Bn C Bn41Yn € N and U B, = B. So, (B) = „(Ù B = 


lim u (Bn) > 0 which implies that 3 k € N such that u (Bp) > 0 aw properties of limits of sequences). 
noo 
Sodu> fodu=fgledu> f gle, du > f Ig, du = u (Br) > 0, which is a contradiction! 

B 


[Scaling] Let a > 0. Then f (af) du =a f fdu. If a > 0, then E(aX) = aE(X). 

Proof: Let f be a simple function. It is trivially true in this case (Why?). We should be careful for 
the case where f fdu = œ and a = 0. We see that af = 0 => [(af) du = 0 = 0 x œ (By convention 
for extended Reals!) = a f f du, so the property holds. 


Let f be a non-negative function. If a = 0, then the result is obvious. So, consider the case a > 0. It 
can be easily seen that q E€ S(f) = aq E€ Slaf). 


f(af)du= sup fq'du= sup f(aq)du = sup f(aq)du = sup as op fadu = 
qg'ES(af) aqES (af) qES(f) qES(f) S(f) 
af f dp. 


E 
With the help of point (3) in section 17.2.2 from Lecture #17, proving the above properties for general 
measurable functions is not difficult, and is left as an exercise to the reader. 
As mentioned in a previous lecture, we now prove the Inclusion-Exclusion property of probability mea- 
sure using indicator random variables and their expectation. 
Inclusion-Exclusion property of probability measures: 
Let (Q, F, P) be a probability space. Let A1, A2,- -> , An be elements of F. Then, 


P (U ai) = P(A) -X P(4N Aj) +-+++(-1)""'P (À a) i 


i<j 
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Proof: 

I n = 1 — I n ; 
U Ai N AG 

qmail: s= 
= 1- [| [L;, 

i=1 

=1-][@-l,) 
i=1 
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= dla = D LL, +e + (SD ala. 
i=1 


i<j 


Taking expectation on both sides of the above equation yields the desired result, since 14,14, = La,na;- 


Now, we summarize all the properties here: 


PAI 1 [iad nA) 

PAI 2 920. f gdu>0 

PAI 3 p= Opa. | gdu=0 

PAI 4 fothan= fodu+ fhan 
PAI 5 0<g<hpac,> fgdu< | hdn 


PAI 6 g=hħuae => fgdu= | hdu 


PAI 8 a>0, f (af) du=a | fdu 


18.2 Exercise: 


PAI 7 92 0nae.and | gdu =0, 9 =0ac. 


(X +Y) = B(X)+ E(Y) 


0< X<Y a.s., > E(X) < E(Y) 


X =Y a.s., > E(X) =E(Y) 


X > 0a.s., and E(X) =0,> X =0as.. 


a > 0, E(aX) = aE(X) 


1. Show that if g : Q — [0,00] satisfies f gdu < oo, then g < œ, p.a.e.. 


2. Let (Q, F, P) be a probability space. Let g : Q — R be a non-negative measurable function. Let 
A be a lebesgue measure. Let f be a non-negative measurable function on the real line such that 
f fdà= 1. For any Borel set A, if Pı (A) = f f dà, then prove that P; is a probability measure. 

A 


3. Let X1, X2, ..., Xn be iid. random variables for which E[X] +] exists. Show that if m < n, then 


n 


[2e] = %, where Sm = X1 + X2 +- + Xm- 


4. Consider the Real line endowed with the Borel o-algebra, and let c € R be fixed. Then the Dirac 
measure at c, denoted as ĝe, is defined on (R, B(R)) as follows. For any Borel set A, ôe(A) = 1 if 
c € A, and ôe(A) = 0 if c ¢ A. It is quite easy to see that (R, B(R), ôe) is a measure space (indeed, 
it is a probability space). The Dirac measure is referred to as unit impulse in the engineering 
literature, and sometimes (incorrectly) called a Dirac delta “function”. 
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(a) Let g be a non-negative, measurable function. Show that f g dé. = g(c). 


Now, let us define a counting measure on (R,B(R)) as u(A) = >> 6,(A). In words, (A)simply 
=1 


n= 
counts the number of natural numbers contained in the Borel set A. In engineering parlance, the 
counting measure is called an impulse train. 


(b) Let g be a non-negative, measurable function. Show that fg dé. = > g(n). Thus, sum- 
n=1 


mation is just a special case of integration. In particular, summation is nothing but integral 
with respect to the counting measure! 
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Lecture 19: Monotone Convergence Theorem 
Lecturer: Dr. Krishna Jagannathan Scribes: Vishakh Hegde 


In this lecture, we present the Monotone Convergence Theorem (henceforth called MCT), which is considered 
one of the cornerstones of integration theory. The MCT gives us a sufficient condition for interchanging limit 
and integral. We also prove the linearity property of integrals using the MCT. Recall the g, — g p.a.e. if 
Gn(w) — g(w) Vw E Q except possibly on a set of p—measure zero. 


19.1 Monotone Convergence Theorem 


Theorem 19.1 Let gn > 0 be a sequence of measurable functions such that gn Ù g p.a.e. (That is, except 
perhaps on a set of u-measure zero, we have gn(w) > g(w), and gn(w) < gn4i(w), n > 1). We then have 


f Indu t f gdu. In other words, 
sim, fon a= fo du. 


See Section 5.2 in Lecture 11 of [1] for the proof. 


Example 19.2 Consider ([0,1],8,A) and consider the sequence of functions given by, 


0, otherwise. 


Oe i if0<w<1/n, 


[feds = 1.9 > lim fiaz 
n— o0 


For w > 0, we have, 
Jim, fn(w) = 0. 
For w = 0, we have, 


Jim fn(w) = œ. 


Therefore we have, 


fta =0. 
Hence we see that, 
fra Æ lim EZ 
noo 


Note that monotonicity does not hold in this example. 
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19.2 Linearity of Integrals 


In this section, we will prove the linearity property of integrals, using the MCT. Recall that we stated the 
linearity property in the previous lecture as PAI 4 but proved it only for simple functions. Here we prove 
it in full generality. 


Let f and g be simple functions. Therefore we can express them as, 


Here A; and B; are F measurable sets and I4, and Ig, are indicator variables. Summing f and g, we obtain, 


ig = SG + ;)lAinB;- (19.1) 


i=1 j=1 


Note that f and g are canonical representations. This implies that A;’s are disjoint sets, and so are B,’s. 
Therefore A; N B; are disjoint sets. Hence we have, 


fi+oa= XOX (a+ b;)u (Ai N B;), 
i=1 j=1 
Sa ous N B;) ad (A; B;) 
i=l j=l j=l i=l 


By finite additivity property, we have, 


Ji +9 du =X aiu(A:) +X bju(B;) 
i=1 j=1 


=f rant [odu 


Next, we need to prove linearity for non-negative measurable functions. Let fn and gn (with n > 1) be 
sequences of simple functions where, fn T f and gn ÙT g. Such a simple sequence always exist for every 
non-negative measurable function, as we will show in the next section. Now, since fn and gn are monotonic, 
fn + gn is monotonic. Then we can show that (fn + gn) tT (f +g). Using MCT, we have, 


I (f+ 9)du = lim / (fn + dn)epe (19.2) 


But fn and gn are simple functions. We know that, for simple functions, 


[fet oan = J fd f and 


lim / (fn +gn)du = lim J fedu + lim J sn 
noo noo noo 

MCT 

= J fdu + I gdp. 


Thus, 
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This implies that, 
i (+e | faut / gdy. (19.3) 


This proves linearity for non-negative functions. 


For arbitrary measurable functions f and g, we can write them as f = f+ — f- and g = g+ — g- where 
f+,f-,94 and g- are non-negative measurable functions. A similar proof can then be worked out which 
completes the proof of linearity. 


19.3 Integration using simple functions 


Our earlier definition f gdp = SUPgES(g) J qdp helped us to prove some properties of abstract integrals quite 
easily. However, it does not give us a practical way of performing the integration. In this section, we present 
a method to explicitly compute the integral, using the MCT. First, we approximate the function to be 
integrated using simple functions from below. Specifically, define 


n, ifg(w)>n, 
eee | . 19.4 
ae B if < goe Psi € {0,1,...,n2"— 1}. ak 


Thus, the function to be integrated in quantized to n2” levels. Next, we note here that g,(w) is a simple 
function since it can be written as 


n2”—1 
1 
Gn(w) = 5 g wi she <o(w) <2} + nly fa) Sn} (19.5) 
i=0 


Claim 1: We can easily show that: 


© mlw) > g(w) Ww ER. 


© gn (w) < gn+1(w) Vw E Q and Vn EN. 
Therefore, using MCT, we have, 


/ gdp = lim | gndp, 


Now, if g is bounded the second term u(gn(w) > n) will be 0 and if g is unbounded, it may or may not be 
finite. 


This gives us an explicit way to compute the abstract integral. 


19.4 Exercise: 


1. Prove Claim 1. 
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2. Let X be a non-negative random variable (not necessarily discrete or continuous) with ELX] < co. 


(a) Prove that lim nP(X >n) = 0. (Hint: Write ELX] = ELXIyx<n}] + EX li x>n}]] 


(b) Prove that E[X] = f P(X > x) dz. Yes, the integral on the right is just a plain old Riemann 
0 


integral! [Hint: Write out E[X] = f x dPx as the limit of a sum, and use part (a) for the last 
term.] 


We say a random variable X is stochastically larger than a random variable Y, and denote by X >s: Y, 
if P(X >a) > P(Y >a) Va ER. 


(c) For non-negative random variables X and Y, show that if X >s Y, then E[X] > E[Y]. 


3. Show that f(x) = x~® is integrable on [0,0o) for a > 1. 


19.5 References: 


[1] DAVID GAMARNICK AND JOHN TSITSIKLIS, “Introduction to Probability”, MIT OCW, 2008. 
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Lecture 20: Expectation of Discrete RVs, Expectation over Different Spaces 
Lecturer: Dr. Krishna Jagannathan Scribe: Arjun Bhagoji 


20.1 Expectations of Discrete RVs 


A discrete random variable X (w), (which only takes a countable set of values) can be represented as follows: 
Definition 20.1 X(w) = X ajl4,(w) where X > 0. 
i=1 


In the canonical representation, the a;’s are non-negative and distinct, and the A;’s are disjoint. It is easy 
to see that the A;’s partition the sample space. Let us now define a sequence of simple random variables, 
which approximate X from below. 


Definition 20.2 Define X,(w) = > ajla,(w). 


X 


Figure 20.1: Simple random variable 


Note that Vw, Xn(w) < Xn4i(w), where n > 1. Next, let us fix w € Q. Since A;’s partition Q, there exists 
k > 1 such that w € Ap. Thus, Vn > k, Xn(w) = ap and Yn < k, Xn(w) = 0. Therefore, 


lim Xn (w) = X (w) Ww eR. (20.1) 


n— o0 


In other words, X,,(w) is a sequence of simple functions converging monotonically to X (w). Now applying 
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the Monotone Convergence Theorem (MCT) to the sequence of random variables Xn, 


[X] = lim E[X,], 


— . P 

= Jim > aiP(A;), 

A H .|P = 2 
= Jim 2 ui P(X = ai), 


> E[X] 


I 
M 
5 
a 
Se 
2 


(20.2) 


The limit of the sum is well-defined as X is a non-negative random variable and it either converges to some 
positive real number or goes to +00. If X is discrete but takes on both positive and negative values, we 
write X = X4 — X_, where X, = max(X,0) and X_ = — min(X, 0). Then, we compute 


[X] = E[X4] — E[X_]. (20.3) 


The above is meaningful when at least one of the expectation on the right hand side is finite. We now give 
some examples. 


OO P 
1. X ~ Geometric(p) - E[X] = X i(1 — p) tp = A 

i=1 
This tells us that, for a geometric random variable, the expected number of trials for the first success 
to occur scales as =. 


2. P(X =k)= £g for k > 1 - For this probability distribution,the expectation is calculated as 


[X] = > (=) = +00. (20.4) 


In this example, we see that a random variable can have infinite expectation. 


3. P(X = k) = 4 for k € Z/{0} - For this probability distribution, the expectation is calculated as 


(| X] = E[X,]—E[X_]. However, both the expectations E[X,] and E[X_] are infinite! Therefore, E[X] 
is not defined! This is an example of a discrete random variable with undefined expectation. 


20.2 Connection between Riemann and Lebesgue integrals 
The connection is given by the following theorem which we state without proof. 


Theorem 20.3 Let f be measurable and Riemann integrable over an interval [a,b]. Then, 


f dX exists, and I 


[a 


b 
. fdà= J f(a) da. (20.5) 


[a,b] 


Here, A is the Lebesgue measure on R. The integral on the left is a Lebesgue integral while the one on the 
right is the standard Riemann integral. 
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20.3 Expectations on different spaces 


We often want to compute the expectation of a function of a random variable, say Y = f(X), where both 
X and Y are random variables and f(-) is a measurable function on R. The following theorem asserts that 
the expectation can be computed over different spaces, to obtain the ‘same answer.’ For example, we can 
compute the expectation of Y by either working in the X-space or the Y-space to write (for discrete random 
variables) 


5 yiP(Y = yi) = D flai)P(X = ai), (20.6) 


where y; = f (ai). This is just a special case of the following theorem 


Theorem 20.4 Denote the probability measure on the sample space by P, on the range space of X as Px 
and on range space of Y as Py. Then, | Y dP = f fdPx = f ydPy where Y = f(X) and the integrals are 
over the respective spaces. 


Figure 20.2: Different spaces considered 


Proof: Let f be a simple function which takes values y1, y2--- yy. Then, 


[re = >v P(w|Y (w) = yi), 


= uPA) = w. 
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Now, looking at the second integral, we have 


[fees = Duke ERIS) =), 
i=1 


= >v Px (fT (y:)), 


= yi P(wiw : X(w) € f-*(ys)), 


=) yiP wl W) = yi): 


Now, we extend the above to the case when f is a non-negative measurable function. Let {fn} be a sequence 
of simple functions such that f, + f according to the construction given in the previous lecture. Thus, 


Fin(X) T f(X) and, 


ż J F(X) dP, 
= lim | fa(X)dP (by MCT), 


This can now be simply extended to the case where g takes both negative and positive values. = 


A simple corollary of this theorem is that f X dP = f «dPx. 


20.4 Exercise 


1. [Dimitri P.Bertsekas] Let X be a random variable with PMF px(x) = i if z = —3, —2, —1,0, 1,2,3 
and zero otherwise. Compute a and E[X]. 


2. [Dimitri P.Bertsekas] As an advertising campaign, a chocolate factory places golden tickets in some 
of its candy bars, with the promise that a golden ticket is worth a trip through the chocolate factory, 
and all the chocolate you can eat for life. If the probability of finding a golden ticket is p, find the 
expected number of bars you need to eat to find a ticket. 


3. [Dimitri P.Bertsekas] On a given day, your golf score takes values from the range 101 to 110, with 
probability 0.1, independent of other days. Determined to improve your score, you decide to play on 
three different days and declare as your score the minimum X of the scores X1, X2 and X3 on the 
different days. By how much has your expected score improved as a result of playing on three days? 


4. [Papoulis] A biased coin is tossed and the first outcome is noted. Let the probability of head occuring 
be p and that of a tail be q = 1 — p. The tossing is continued until the outcome is the complement of 
the first outcome, thus completing the first run. Let X denote the length of the first run. Find the 
PMF of X and show that E[X] = ? + 4. 
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Lecture 21: Expectation of CRVs, Fatou’s Lemma and DCT 


Lecturer: Krishna Jagannathan Scribe: Jainam Doshi 


In the present lecture, we will cover the following three topics: 


e Integration of Continous Random Variables 
e Fatou’s Lemma 


e Dominated Convergence Theorem (DCT) 


21.1 Integration of Continuous Random Variables 


Theorem 21.1 Consider a probability space (Q,F,P). Let X : Q — R be a continuous random variable. 
Let g be a measurable function which is either non-negative or satisfies f |g\dPx < œ. Then, 


‘[o(X)] = i gfx 


In particular, if g(a) = x, i.e. the identity map, we have 


|X] = feixa. 


K 
Proof: Let us first consider the case of g being a simple function i.e. g = >> ajI4, for some measurable 
i=1 
disjoint subsets A; over the real line. We then have 


K 
= 5 a;iPx(4A;) [g is a simple function] 
i=1 
K 
= 5 ai J fxdà [From Radon-Nikodym Theorem] 
i=1 ft 


J ai fxdà la; is a constant] 
A 


K 
= X J (aila,fx)dA — [La, is the indicator random variable of event A;] 
i=1 Q 
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K 
= / X (ail A:ifx)dà [Interchanging finite summation and integral] 
i 


Thus we have proved the above theorem for simple functions. We now assume g to be a non-negative mea- 
surable function which may not necessarily be simple. 


Let gn be an increasing sequence of non-negative simple functions that converge to g point wise. One way 
of coming up with such a sequence was discussed in the previous lecture. We then have, 


Lg(X)] 


II 


lim | gJndPx [From MCT] 

noo 

= lim J gnfxdà  |From result for simple functions] 
n—> 00 


= [sta [From MCT, since gn fx t gfx! 


For arbitrary g which are absolutely integrable, a similar proof can be worked out by writing g = g+ — g- 
and proceeding. = 


Example 1: Let X be an exponential random variable with parameter u. Find E[X] and E[X?]. 
Solution: Recall that for an exponential random variable with parameter u, fx (x) = we“*. Thus, we have 


T 1 
[X] = f efxan = fouar S 
u 

0 


T 2 
1X2] = fPtxar= f Pye tas = e 
0 


Example 2: Let X ~ N(y,07). Find E[X] and E[X?]. 
2 
Solution: Recall that the density of X is given by fx(x) = me 202 . Thus, we have 


co 


1 _ (wn)? 
[X] = f efxar= J £ A 32 dz = p. 
oo og 
UX? ] = f fxd = / z2 Se = +o. 
OV 2T 
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x 


Example 3: Let X be a one-sided Cauchy random variable i.e. fx(x) = oe for x > 0. Find E 
Solution: We have 


~ — = 7 2 1 — 
0 


Example 4: Let X be a two-sided Cauchy random variable i.e., fx(x) = ae for Vz € R. Find E[X]. 
Solution: In this case the random variable X takes both positive and negative values. Hence, we need to 
find E[X4+] and E[X_] seperately and then evaluate E[X] = E[X,] — E[X_]. Recall that X} = max(X, 0) 


and X_ = — min(X, 0). Thus, 


X4(w) =0 for w € A = {w E€ Q|X (w) < 0}, 
X+ (w) = X (w) for w € A®. 


Similarly, 


X_(w) =0 for w € B = {w € Q|X (w) > OF, 
X_(w) = —X(w) for w € B°. 


It is easy to see that P(A) = P(B) = 0.5. Next, we have 


[X4] = | edPx, 
xJ- 0x P(A) f ee dx = 
es Ge EEEN TAR 
0 
Similarly, 

[x] = f adP x. 
0 

[X J=0xP(B)+ | -2 2 zdz = œ 


Thus, we have a case of co — co and E[X] is undefined. 


Note that in Example 2 also, X takes both positive and negative values and we should find E[X+] and E[X_] 
seperately and evaluate E[X] = E[X4]—E[X_]. But in that case both E[X4] and E[X_] are finite, allowing 
us to integrate with respect to the pdf fx(a) from —oo to oo directly. 


Note: For the two sided Cauchy, 


oo Mə 
ee 1 1 
2 dr ê li — dr. 
| ir dm irp” 


Soa M2>œ Mı 


The above limit does not exist and hence the integral is not defined. 


21.2 Fatou’s Lemma 


Before we state Fatou’s lemma, let us motivate it with an elementary result. 
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Lemma 21.2 Let X and Y be random variables. Then, 


Proof: By definition, we have 


E [min(X,Y)] < min (E[X], E 


i [max(X, Y )] > max (E[X], E 


Taking expectations on both the sides, 


min(X,Y) < X. 

min(X, Y) < Y. 
[min(X,Y)] < E[X]. 
fmin(X,¥)] < E[Y]. 


Combining the above two equations, we get 


E [min(X, Y)] < min (E[X], 


Y]). 


The other statment of the lemma involving maximum of X and Y can be proved in a similar way and is left 


to the reader as an exercise. 


The above lemma can be generalized to any finite collection of random variables and a similar proof can be 
worked out. Fatou’s Lemma generalizes this idea for a sequence of random variables. 


Lemma 21.3 Fatou’s Lemma: Let Y be a random variable that satisfies E[|Y |] < oo. Then the following 


holds, 


e IfY < Xn, for alln, then 


e IfY > Xn, for alln, then 


j [lim inf Xal < liminf E[X,]. 
noo n+ co 


n—->co noo 


5 fimsup x,| > limsupELX,,]. 


Proof: Let us start by proving the first statement. For some n we have 


Taking expectations, 


inf Xk -Y < Xm —Y, Vm>n. 


Taking infimum with respect to 


} E Xp- y| < E[Xm — Y], 
k>n 


m on R.H.S, we obtain 


| int x, -Y| < inf ELX,, —Y], 
k>n 


man 


Let Zn = jut Xk— Y. Note that Zn > 0 since Xm > Y V m and Zn is a non-decreasing sequence of random 
2n 


variables. 
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Also, Z = lim Zn = liminf Xn — Y. By MCT, we have 
noo 


n— o0 


j [lim inf Xn — Y| < liminf ELX, — Y]. 


n—> o0 n— o0 


As E||Y |] < co, we can invoke linearity of expectation to get the first result of Fatou’s lemma. 


The second statement can be proved similarly and is left to the reader as an exercise. 


21.3 Dominated Convergence Theorem 


The DCT is an important result which asserts a sufficient condition under which we can interchange a limit 
and integral. 


Theorem 21.4 Consider a sequence of random variables Xn that converges almost surely to X. Suppose 
there exists a random variable Y such that |Xn| < Y almost surely for alln and E[Y] < co. Then, we have 


lim E[X,] = E[X]. 


n—->co 


Proof: We have |X,,| < Y which implies -Y < X„ < Y. We can now apply Fatou’s lemma to obtain 


|X] = E [lim inf X,,| < lim inf E[X,,] < limsupE[X,] < E him sup X, = EX]. 


n—->co n— o0 noo n—- oo 


Thus, all the inequalities in the above equation must be met with equalities and we have 


(| X] = lim inf E[X,,] = lim sup E[X,], 


n— co noo 


which proves that the limit, lim E[X,,] exists and is given by 
n—->oco 


lim E[X,] = E[X]. 


n— o0 


Thus we see that Dominated Convergence theorem (DCT) is a direct consequence of Fatou’s Lemma. The 
name “dominated” is intuitive because we need |X,,| to be bounded by some random variable Y almost 
surely for every n. However, we do not require X,,’s to be monotonically increasing as in the case of MCT. 


Corollary 21.5 A special case of DCT is known as Bounded Convergence theorem (BCT). Here, the random 
variable Y is taken to be a constant random variable. BCT states that if there exists a constant c E€ R such 
that |Xn| < c almost surely for all n, then lim E[X,] = ELX]. 


n— o0 
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21.4 Exercise 


1. [MIT OCW problem set] A workstation consists of three machines, M1, Mz and M3 , each of which 
will fail after an amount of time T; which is an independent exponentially distributed random variable, 
with parameter 1. Assume that the times to failure of the different machines are independent. The 
workstation fails as soon as both of the following have happened: 

(a) Machine Mı has failed. 
(b) Atleast one of the machines Mz or M3 has failed. 


Find the expected value of the time to failure of the workstation. 


2. [Assignment problem, University of Cambridge] Let Z be an exponential random variable with param- 
eter A= 1 and Zim = | Z|. Compute Eļ|Zin:]. 


3. [Prof. Pollak, Purdue University] Suppose Sk and Sn are the prices of a financial instrument on days 
k and n, respectively. For k < n, the gross return Gy, between days k and n is defined as Gyn = oe 
and is equal to the amount of money you would have on day n if you invested $1 on day k. Let Gk k+1 
be lognormal random variable with parameters u and 0”, Vk > 1, and the random variables Gj,j+1 
and Gk, k+1 are independent and identically distributed Vk # j. Find the expected total gross return 
from day 1 to day n. 
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Lecture 22: Variance and Covariance 


Lecturer: Dr. Krishna Jagannathan Scribes: R.Ravi Kiran 


In this lecture, we will introduce the notions of variance and covariance of random variables. Earlier, we 
learnt about the expected value of a random variable which gives an idea of the average value. The idea of 
variance is useful in describing the extent to which the random variable deviates about its mean on either 
side. The covariance is a property that characterizes the extent of dependence between two random variables. 


22.1 Variance 


As stated earlier, the variance quantifies the extent to which the random variable deviates about the mean. 
Mathematically, the variance is defined as follows : 


Definition 22.1 Let X be a random variable with E[X] < œ. The variance of X is defined as 


Var(X) = o% = E [(X — E[X])’] . 


ox is referred to as the standard deviation of the random variable X. 


22.1.1 Properties of Variance 


We will now study a few properties of the variance of a random variable. 


First and foremost, we can clearly see that for any real valued random variable X, g(X) = (X — E[X]}? > 0. 
Thus it is easy to see that o% > 0 from property PAI 2 from Lecture #18. In fact, we can make the 


following stronger statement regarding the variance of a random variable. 


Lemma 22.2 Let X be a real valued random variable. Then, Var(X) = 0 if and only if X is a constant 
almost surely. 


Proof: We will first prove the sufficiency criterion in the above statement. That is, assume that X is a 
constant valued random variable almost surely. Thus, it is evident that X = E[X] almost surely, consequently 
implying that o% = 0. 


To prove the necessity condition in the statement, assume that X is a random variable with zero variance. 
Thus, we have the following : 


ox = E[(X-E[X)] =0. 
= {(X—E[X])dPx =0. (22.1) 


Applying PAI 7 from Lecture #18 to (22.1), we can conclude that (X — E[X])? = 0 almost surely. Thus, 
we have X = E[X] almost surely. 
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Now, using some simple algebra, we make a few useful observations. 


X -E[X])*], 
X?) + (E[X])* — 2XE[X]] , 


( 
( 


XA 
xX?] 


al 

E | 
© E[X?] - 2E[X].E[X] + (ELX)? 

E [X?] - (E[x])’, (22.2) 


where (a) follows from the linearity of expectation (PAI 4 from Lecture #18). Now using the fact that 
oł > 0 and (22.2), we can see that E[X?] > (E[X])?. The term E [X?] is referred to as the second moment 
of the random variable X. 


An interesting digression: 


Theorem 22.3 (Jensen’s Inequality) Let X be any real valued random variable and let h(-) be a function 
of the random variable. Then, 


1. If h(-) is convex, then E[h(X)] > h(E[X]). 


2. If h(-) is concave, then E[h(X)] < h(E[X]). 


3. If h(-) is linear, then E[h(X)] = h(E[X]). 


A guided proof of Jensen’s inequality will be encountered in your homework. 


[X?] > (E[X])?. 


Since f(x) = x? is a convex function, we can invoke Theorem 22.3 and observe that 


Let us look at a few examples. 


Example 1: Let X be a Bernoulli random variable with parameter p i.e., 


_ J 1 wo. p, 
x={ 0 w.p. 1—p. 


Find the variance of X. 
Solution: We have 


[X]=px1+(1—p) x0, 


=p. 
Next, 
[X?] =p x 1° + (1—p) x 0’, 
=p. 
Finally, 
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Example 2: Let X be a discrete valued random variable with Poisson distribution of parameter À. That 
is, P(X =k) = "vk € Z* U {0}. Find the variance of X. 
Solution: We have 


Next, 


oo —ryk 
one FETAN 
[X] = 2t I> 


E 5 AE-1+1)eOrE-) 
(k — 1)! 


o0 —Ay(k-2) oo Aj (k-1) 


a ear ye aces 


Finally, 


Example 3: Let X be a discrete random variable with P(X = k) = ROR for k € N, where ¢(-) is the 


Riemann zeta function. Find o%. 
Solution: We have, 
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Next, we have 


Finally, 


ox = E [X?] - (E[x))’, 


The above example is a case of a random variable with finite expected value but infinite variance! 
Example 4: Let X be a uniform random variable in the interval [a,b]. Find the variance of X. 


Solution: Recall that the density of X is given by 


fora < x < b, 
otherwise . 


fx(z) = { a 


Now, we have 


Next, we have 


(b — aè) 
~ Bb a)’ 
a? +ab +b? 
e 
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Finally, 


ox = E[X?] — (E[X])”, 
a+ab+b? a*%+2ab+b? 


3 4 , 
b? — 2ab + a? 

12 í 
(b - a)? 


12 


Example 5: Let X be an exponentially distributed random variable with parameter u. Find o%. 
Solution: Recall that for an exponential random parameter fx(x) = we“ for x > 0. 


x] = f zfx(2)az, 


Next, we have 


Finally, 


Example 6: Let X ~ N (u, 0°). Find o%. 
Solution: From Example 2 in Lecture #21, we know that E[X] = u and E [x?] = u? +07. Thus, we have 


ox = E [X?] - E[X])’, 
= +o? — (u), 


o’. 


II 


Note that the normal distribution is parametrized by the expected value u and the variance o°. 
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22.2 Covariance 


Having looked at variance, a term that characterizes the extent of deviation of a single random variable 
around its expected value, we now define and study the covariance of two random variables X and Y, a term 
that quantifies the extent of dependence between the two random variables. 


Definition 22.4 Let X and Y be random variables on (Q,F,P). Further, let E[X] < co and E[Y] < ov. 
The covariance of X and Y is given by 


cov(X, Y) = E[(X — E|X])(Y — E[Y])] = E[XY] — ELX]E[Y]. 


Definition 22.5 Let X and Y be random variables. X and Y are said to be uncorrelated if cov(X,Y) = 0, 
i.e, if E[XY] = E[X]E[Y]. 


Thus, two random variables are uncorrelated if the expectation of their product is the product of their 
expectations. The following theorem asserts that independent random variables are uncorrelated. 


[|X|] < œ, E[|Y|] < œ. Then E[XY] 


Theorem 22.6 If X and Y are independent random variables with 
exists, and E[XY] = E[X]E[Y] i.e., cov(X,Y) = 0. 


Proof: We will prove this theorem in three steps. We will first assume that the random variables X and Y 
are simple and thus can be represented as follows : 


Lug and oe 
i=l 


i=l 


Assuming canonical representation of the random variables X and Y, we have 


> 2 TiyYj Ja, AB;): 


i Eh): 


Thus, we have, 


[XY] = J XYaP, 


= Le (xiy;)P(A; N B;). (22.3) 


o(Y) are independent o—algebras. Also, 


Next, as X and Y are independent random variables, o(X) and 
= bj} € o(Y). By definition of independent o— 


A, = {w E€ O|X(w) = aj} € o( X) and B; = {w E€ Q|X (w) 
algebras, 
ee: P(A; N B;) = P(A;)P(B;), Vi, j. (22.4) 


Using (22.4) in (22.3), we get 


IXY] = $9 ruP(4:)P(B;), 


= (Eara) > wP(B;) l 


= E[X]E[Y]. 
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We will now extend the proof to non-negative random variables. Let X and Y be non-negative random 
variables. Let the sequences of simple random variables, X» and Y,,, be such that Xn T X and Yp t Y. We 
know that such a sequence exists from section 3 in Lecture #19. Also, by construction, it is easy to see that 
Xn and Yn are independent. Consequently, we have X,Y, 1? XY. Thus, 


xy] “ET lim EEK © ( lim ‘[Xn) ( lim [Ya]) MCT RIXJEIY], (22.5) 


n= oo n— o0 n— o0 


where (a) follows from the independence of Xn and Y, and since both the limits exist. 


Finally, for the case of X and Y possibly being negative, let X = X4, — X_, and let Y = Y} — Y_ where 
X+, X-,Y, and Y_ are as defined in Lecture #17. Then 


[XY] = E[X}Y,] + E[|X-Y_] -— E[|X+Y_] — E[X-Y;], (22.6) 
= E[X4]E[Y+] + E[X-]E[Y-] — E[X+]E[Y-] - E[X-]E[Y+], (22.7) 
= E[X]E[Y]. (22.8) 


where (22.6) and (22.8) follow from the linearity of expectations (PAI 4 from Lecture #18) and (22.7) follows 
from (22.5). Note that X} and X_ are functions of X, and Y} and Y_ are functions of Y. Since X and 
Y are independent, all the pairs of random variables inside expectation in RHS of (22.6) are independent.! 
Thus, we have proved that independent random variables are uncorrelated. 


Caution: While independence guarantees that two random variables are uncorrelated, the converse is not 
necessarily true i.e., two uncorrelated random variables may or may not be independent. We show this by a 
counter example. 


Let X ~ unif[-1,1] and Y = X? be two random variables. It can be shown that X and Y are not 
independent. However, 


cou(X,Y) = E[XY] — E[X]E[¥], 
=E[X*]-E 


Sii; 


= 0, 


E 


where (a) follows since X is symmetric around 0. 
Thus, we have an example where two random variables X and Y are uncorrelated but not independent. 


Proposition 22.7 Consider two random variables X and Y. Then, we have 


Var(X +Y) = Var(X) + Var(Y) + 2cov(X, Y). 


Proof: 


II 


Var(X +Y) 2 (X + Y)?] — (E[X] + E[Y])? 
= E[X?+Y?+2XY] — (E[X]?+E[Y] + 2E[X]E[Y)) , 
2 [X?] — ELX]’) + (E [Y?] — E[Y]?) + 2(E[XY] — E[X]E[Y]), 


ar(X) + Var(Y) + 2cov(X,Y). 


SS 5 


'Let X and Y be independent random variables on (Q,F,P). Also, let f(-) and g(-) be measurable functions from R to R. 
Then, f(X) and g(Y) are independent random variables. 
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It is easy to see that if X and Y are uncorrelated, then Var(X + Y) = Var(X) + Var(Y). This can of 
course be extended to the sum of any finite number of random variables. 


Definition 22.8 Let X and Y be random variables. Then, the correlation coefficient for the two random 


variables is defined as : 
A cov( X,Y) 


Var(X)Var(Y) 


PX.Y 


Theorem 22.9 Cauchy-Schwartz Inequality For any two random variables X and Y, —1 < pxy <1. 
Further, if px,y = 1, then there exists a > 0 such that Y — E[Y] = a (X —E[X]) and if px,y = —1, then 
there exists a < 0 such that Y — E[Y] = a (X — E[X]). 


Proof: Let X = X — E[X] and Y = Y — E[Y]. Now we know that, 


(2 ey 


(a) 
> 0, (22.9) 


vi 


(22.10) 


where (a) follows from PAI 2 of Lecture #18 and (b) follows from linearity and scaling property of expectation 
(PAI 4 and PAI 8 of Lecture #18). From definition, E[X?] = Var(X) and E[Y?] = Var(Y). Further, we 
can observe that E[XY] = cov( X,Y ).Thus, it is easy to see that 


pxy CO i dha (22.11) 


= J/Var(X)Var(Y) y [%],/ f fa 


Combining (22.10) and (22.11), we get 


-l< pxy <1. 


Note that px,y = 1 or px,y = —1 when the (22.9) is met with equality. This happens when x= oly 


almost surely which proves the second part of the theorem. 
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The discussion regarding Cauchy-Schwartz inequality above has a close connection with Hilbert Spaces. As 
one may recall from a course in Linear Algebra, a Hilbert Space is a complete vector space endowed with an 
inner product. 


Let (Q, F, P) be a probability space, and let £2 be a collection of all zero-mean, real-valued random variables 
defined over this space with finite second moment. It can be shown that Lə with addition of functions and 
scalar multiplication (obeyed except perhaps on a set of measure zero) is a Hilbert Space. The associated 
inner product is the covariance function. We say that two random variables from £% are equivalent if they 
agree, except perhaps on a set of measure zero. That is, X ~ Y (read as X is equivalent to Y) if P(X = Y) 
= 1, for any X,Y € Lə. Thus, £2 is partitioned into several such equivalence classes by the aforementioned 
equivalence relation. 


In light of this discussion, the covariance function can be interpreted as the dot product of the Hilbert space, 
and the correlation coefficient is interpreted as the cosine of the angle between two random variables in this 
Hilbert space. In particular uncorrelated random variables are orthogonal! The interested reader is referred 
to sections 7 through 11 of chapter 6 in [1] for a more detailed treatment of this topic; this viewpoint is 
especially useful in estimation theory. 


22.3 Exercise 


1. /Papoulis] Let a and b be positive integers with a < b, and let X be a random variable that takes as 
values, with equal probability, the powers of 2 in the interval [2°,2°]. Find the expected value and 
variance of X. 


2. [Papoulis] Suppose that X and Y are random variables with the same variance. Show that X — Y and 
X +Y are uncorrelated. 


3. [Papoulis] Suppose that a random variable X satisfies E[X] = 0, E[X?] = 1, ELX?] = 0 and E[X“] = 3 
and let Y =a+bX +cX?. Find the correlation co-efficient PXY- 


4, [Assignment problem, University of Cambridge] Take 0 < r < 1. Let X and Y be independent random 


variables taking values +1 with probabilities 4. Set Z = X, with probability r and Z = Y, with 


2 
probability 1 — r. Find px,z. 


5. [Papoulis] Let X1, X2, ..., Xn be independent random variables with non-zero finite expectations. Show 


that z 
var( | [ X;) H 
i=1 var(X;) 
n = I ay t1)-1 
Lev IT ( IX; ) 
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Lecture 23: Conditional Expectation 


Lecturer: Dr. Krishna Jagannathan Scribe: Sudharsan Parthasarathy 


Let X and Y be discrete random variables with joint probability mass function px,y(a,y), then the condi- 
tional probability mass function was defined in previous lectures as 


PX,Y (x, y) 
py (y) 


9 


pxiy (zy) 7 


assuming py (y) > 0. Let us define 


[X|¥ =y] = 2 zoxy aly). 


w(y) = ELX|Y = y] changes with y. The random variable w(Y) is the conditional expectation of X given Y 
and denoted as E[X|Y]. 


Let X and Y be continuous random variables with joint probability density function fx,y (x,y). Recall the 
conditional probability density function 


fevel) = SE. 


when fy(y) > 0. Define 


[XY =y] = | nfxiy (clyde 


The random variable (Y) is the conditional expectation of X given Y and denoted as E[X|Y]. 


Example 1: Find E[Y|X] if the joint probability density function is fx,y(v,y)=4;0<y<a <1. 


Solution: fx(z =f lidy =1,0<a2<1 


fy x (y|x)= fe pO<ySa@ 


Y|X = z] = jt = j dy = 5 


The conditional expectation E[Y |X] = %. 


Theorem 23.1 Law of Iterated Expectation: 


[Y] = Ex [E[Y |X]]. 
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Proof: We prove the result for discrete random variables. We have 


ix (E[Y |X] = } J px (@)E[Y|X = 2] 

= Zrt) 2 vev ix lole) 
= 2px) s 
= X ypx y (2,9) 

= EEren) 

= AT 


= E[Y]. 


Similarly law of iterated expectation for jointly continuous random variables can also be proved. 


Application of the law of iterated expectation: 


N 
SN=}_ X;, where {X1,...Xy} are independent and identically distributed random variables. N is a non- 
i=l 


negative random variable independent of X; V i € {1,..N}. From the law of iterative expectation, E[Syn] = 
in (E[Sn|N]]. Consider 


II 


SNIN =n] 


N 
7 bs X,|N = | (23.1) 
=E 2 X;|N = | (23.2) 


As N is independent of X;, E | > XilN = n|= ) | $ x| =nE[|X]. 


Thus E[Sy|N] = NE[X], E[S] = E[NJELX]. 


Theorem 23.2 Generalized form of Law of Iterated Expectation: 


For any measurable function g with E[|g(X)|] < œ, 


[Y g(X)] = E[E[Y|X]g(X)]. 
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Proof: We prove the result for discrete random variables. We have 


EY IXI (X) = Yo px(2)ElIY |X = 2]g(2) 
= X px(x)g(x) = ypy |x (y|x) 
= Lrxterate) 


> yg(x)px,y (x,y) 


II 


zy 
= E[Yg(X)]. 
E 
Exercise: Prove E[Yg(X)] = E[E[Y|X]g(X)] if X and Y are jointly continuous random variables. 
This theorem implies that 
[(¥ — E[Y|X])g(X)] = 0. (23.3) 


The conditional expectation E[Y|X] can be viewed as an estimator of Y given X. Y — E(Y|X) is then the 
estimation error for this estimator. The above theorem implies that the estimation error is uncorrelated 
with every function of X. 


Observe that in this lecture, we have not dealt with conditional expectations in a general framework. Instead, 
we have separately defined it for discrete and jointly continuous random variables. In a more general 
development of the topic, (23.3) is in fact taken as the defining property of the conditional expectation. 
Specifically, for any g(X), one can prove the existence and uniqueness (up to measure zero) of a a(X)- 
measurable random variable Y(X), that satisfies E[(¢(X) — Y)g(X)] = 0. Such a 7(X) is then defined as 
the conditional expectation E[Y|X]. For a more detailed discussion, refer Chapter 9 in [1]. 


Minimum Mean Square Error Estimator: 


We have seen that E[Y|X] is an estimator of Y given X. In the next theorem we will prove that this is indeed 
an optimal estimate of Y given X, in the sense that the conditional expectation minimizes the mean-squared 
error. 


Theorem 23.3 If E(Y?) < œ, then for any measurable function g, 
(Y — E[Y|X])?] < E[(Y — 9(X))”]. 


Proof: 
(Y — g(X))*] = E[(Y — E[Y|X])7] + E[(E[Y|X] — 9(X))*] + 2E[(¥ — E[Y|X])(E[Y|X] - 9(X))] 
> E[(Y — E[Y|X])’. 


This is because E[(Y — E[Y|.X])(E[Y|X] — g(X))] =0 (by (23.3)), and E[(E[Y|X] — 9(X))?] > 0. 


(Y — E[Y|X])(E[Y|X] — g(X))] =0 as from (23.3) we know that E[(E[Y|X] — Y)u(X)] = 0. Here Y(X) = 
(E[Y |X] — 9(X)). m 


From (23.3) we observe that the estimation error Y — (E[Y|X)] is orthogonal to any measurable function 
of X. In the Hilbert Space of square integrable random variables, E[Y |X] can be viewed as the projection 
of Y onto the subspace £2(0(X)) of o(X) measurable random variables. As depicted in Figure 23.1, it is 
quite intuitive that the conditional expectation (which is the projection of Y onto the subspace) minimizes 
the mean-squared error among all random variables from the subspace £2(0(X)). 
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Figure 23.1: Geometric interpretation of MMSE 


23.1 Exercises 


1. Prove the law of iterated expectation for jointly continuous random variables. 


2. (i) Given is the table for Joint PMF of random variables X and Y. 


| X=0 | X=1 | 
ya m 


| Y=1 0 | 


Let Z = E[X|Y] and V = Var(X|Y). Find the PMF of Z and V, and compute E[Z] and E[V]. 
(ii) Consider a sequence of iid. random variables {Z;} where P(Z; = 0) = P(Z; = 1) = 4. Using 

this sequence, define a new sequence of random variables {Xn} as follows: 

Xo = 0, 

Xı = 2Zı = th and 

Xn = Xn-1 + (1 + Zi +... + Zn-1)(2Zn - 1) for n > 2. 

Show that E[X,41|Xo0, X1,..., Xn] = Xn a.s. for all n. 


or] KT EY 


3. (a) [MIT OCW problem set] The number of people that enter a pizzeria in a period of 15 minutes 
is a (nonnegative integer) random variable K with known moment generating function Mx(s). 
Each person who comes in buys a pizza. There are n types of pizzas, and each person is equally 
likely to choose any type of pizza, independently of what anyone else chooses. Give a formula, in 
terms of Mx(.), for the expected number of different types of pizzas ordered. 


(b) John takes a taxi to home everyday after work. Every evening, he waits by the road to get a taxi 
but every taxi that comes by is occupied with a probability 0.8 independent of each other . He 
counts the number of taxis he missed till he gets an unoccupied taxi. Once he gets inside the taxi, 
he throws a fair six faced die for a number of times equal to the number of taxis he missed. He 
counts the output of the die throws and gives a tip to the driver equal to that. Find the expected 
amount of tip that John gives everyday. 
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Lecture 24: Probability Generating Functions 


Lecturer: Dr. Krishna Jagannathan Scribe: Debayani Ghosh 


24.1 Probability Generating Functions (PGF) 


Definition 24.1 Let X be an integer valued random variable. The probability generating function (PGF) 
of X is defined as : 


Gx(z) SE[z*] = X` P(X =å). 


4 


24.1.1 Convergence 


For a non-negative valued random variable, there exists R, possibly +oo, such that the PGF converges for 
|z| < R and diverges for |z| > R where z € C. Gx(z) certainly converges for |z| < 1 and possibly in a larger 
region as well. Note that, 


IGx(z)| = 


XC P(X =i) 


i 


DD 
a 


This implies that Gx(z) converges absolutely in the region |z| < 1. Generating functions can be defined 
for random variables taking negative as well as positive integer values. Such generating functions generally 
converge for values of z satisfying a < |z| < 8 for some a, 3 such that a < 1 < £. 


Example 1: Consider the Poisson random variable X with probability mass function 


—rAyi 
psia N 


a! 


Find the PGF of X. 


Solution : The PGF of X is 
z\ter 


a eed, vee. 


Gx(z) = 


i=1 
Example 2: Consider the geometric random variable X with probability mass function 
P(X =i)=(1-p p, i21. 
Find the PGF of X. 


Solution : The PGF of X is 
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24.1.2 Properties 


1. Gx(1) =1. 


2, SKO) = ELI. 


Proof: From definition 


where the interchange of differentiation and summation in (a) is a consequence of absolute convergence 
of the series X` z’*P(X = i). Thus, 
i 


dG'x (z) 
dz z=1 


= EX]. 


3. XO) =E[X(X-1)(X-2)---(X-k+1)]. 

25I 

4. If X and Y are independent and Z = X +Y, then Gz(z) = Gx(z)Gy(z). The ROC for the PGF of z 
is the intersection of the ROCs of the PGFs of X and Y. 
Proof : 


Gz(z) = E[z?] = E[z***] = E[z*.z*]. 


Since X and Y are independent, they are uncorrelated. This implies that 


i[2* 2] = E[z*]E[z”] = Gx(z)Gy(z). 


Hence proved. 


N 
5. Random sum of discrete RVs : Let Y = X X;, where X;’s are i.i.d discrete positive integer valued 


i=l 
random variables and N is independent of X;’s. The PGF of Y is Gy (z) = Gn(Gx(z)). 
Proof : 


Gy(z) =E[z’] = E [E [2” |N]] (By law of iterated expectation). 


Now, 


[IN =n] =E a N= n! = E[Gx(2)"). 


This implies that 
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24.2 Exercise 
1. Find the PMF of a random variable X whose probability generating function is given by 
Gx(z) — (4z+43)* 


2. Suppose there are Xo individuals in initial generation of a population. In the n*” generation, the Xn in- 
dividuals independently give rise to numbers of offspring Yy®, Y”, ee, we, where y,”, yg”, bis yy” 
are i.i.d. random variables. The total number of individuals produced at the (n + 1)*’ generation will 
then be Xyn41 = y™ + yf”) +...+ yO. Then, {Xn} is called a branching process. Let Xn be the size 
of the nt”? generation of a branching process with family-size probability generating function G(z), and 
let Xo = 1. Show that the probability generating function G,,(z) of Xn satisfies Gn+1(2) = G(Gnr(z)) 
for n > 0. Also, prove that E[X,,] = ELX,_i]G’(1). 


EE5110: Probability Foundations for Electrical Engineers July-November 2015 


Lecture 25: Moment Generating Function 


Lecturer: Dr. Krishna Jagannathan Scribe: Subrahmanya Swamy P 


In this lecture, we will introduce Moment Generating Function and discuss its properties. 


Definition 25.1 The moment generating function (MGF) associated with a random variable X, is a func- 
tion, Mx : R > [0,00] defined by Mx(s) = E [e**]. 


The domain or region of convergence (ROC) of Mx is the set Dx = {s|Mx(s) < oo}. In general, s can 
be complex, but since we did not define expectation of complex valued random variables, we will restrict 
ourselves to real valued s. Note that s = 0 is always a point in the ROC for any random variable, since 
Mx (0) = 1. 
Cases: 

e If X is discrete with pmf px (x), then Mx(s) = >> e**px(a). 

e If X is continuous with density fx(-), then Mx(s) = f e°? fx (a) dz. 
Example 25.2 Exponential random variable 


fx (x) Ta pe "?, Lz 0, 


F Æ, ifs< u, 
Mx(s)= Jentia = T ina 
0 


+oo, otherwise. 
The Region of Convergence for this example is, {s|Mx(s) < co}f,i.e., s< pn. 


Example 25.3 Std. Normal random variable 


The Region of Convergence for this example is the entire real line. 


Example 25.4 Cauchy random variable 


1 
fx (z) = n(1 $ x?) ) R, 
17 1 i, oso 
M a Sx d = 9 3 
x(s) T $ g 142? i a otherwise. 


=PO 


The Region of Convergence for this example is just the point s = 0. 
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Remark 2: The above examples can be interpreted as follows. 


e In Example 25.2, we have the product of two exponentials. Thus, the MGF converges when the product 
is decreasing. 


a2 . . 
e In Example 25.3, there is a ’competition’ between e~ = and ef. Since the first term from the Gaussian 
decreases faster than e°” increases (for any s), the integral always converges. 


e In Example 25.4, for s Æ 0, an exponential competes with a decreasing polynomial, as a result of which 
the integral diverges. 


It is an interesting question whether or not we can uniquely find the CDF of a random variable, given the 
moment generating function and its ROC. A quick look at Example 25.4 reveals that if the MGF is finite only 
at s = 0 and infinite elsewhere, it is not possible to recover the CDF uniquely. To see this, one just needs to 
produce another random variable whose MGF is finite only at s = 0. (Do this!) On the other hand, if we can 
specify the value of the moment generating function even in a tiny interval, we can uniquely determine the 
density function. This result follows essentially because the MGF, when it exists in an interval, is analytic, 
and hence possesses some nice properties. The proof of the following theorem is rather involved, and uses 
the properties of an analytic function. 


Theorem 25.5 (Without Proof) 


i) Suppose Mx(s) is finite in the interval |—€, €] for some e > 0, then Mx uniquely determines the CDF 
of X. 


ii) If X and Y are two random variables such that, Mx(s) = My(s) Vs € [-e,€],€ > 0 then X and Y 
have the same CDF. 


25.1 Properties 


1. Mx(0) = 1. 
2. Moment Generating Property: We shall state this property in the form of a theorem. 


Theorem 25.6 Supposing Mx(s) < œ for s € [—e,€], € > 0 then, 


d 
— =E|X]. 25.1 
£ Mx(s)| _ =E[X] (25.1) 
More generally, 
Mx(s) =E|X"]; m>1 
ds™ A s= m i 


Proof: (25.1) can be proved in the following steps. 


d — 4a. 6x) Op 4 Xi sX 
£ Mx(s) = S Ble] © mj Ée] = B[Xe™| 


where, (a) is obtained by the interchange of the derivative and the expectation. This follows from the 
use of basic definition of the derivative, and then invoking the DCT; see Lemma 25.7 (d). E 
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Lemma 25.7 Suppose that X is a non-negative random variable and Mx(s) < co, Vs € (—c, al, 
where a is a positive number, then 


(a) 
(b) 
(c) 
(a) 


L| X"] < œœ, for every k. 
E[X*e**] < 00, for every s <a. 


hX 
e — < Xex. 


RX] _ 
[X] = Ellimayo SS] = limay I“ 


Proof: Given that X is a non-negative random variable with a Moment Generating Function such 
that Mx(s) < co, Vs € (—co, a], for some positive a. 


(a) 
(b) 


(c) 


For a positive number a, x* < e%, Yk € Zt U {0}. Therefore, 
However, f e°”dPx = My (a) < œ. Therefore, E[X*] < oo. 


For s < a, Je > 0 such that Mx(s +€) < œ => f etedPyx < œ. But since € > 0, as £x — 00, 
x! < e. Therefore, E[X*e¥] = f r*e%*dPx < fete®dPy < 00 => E[X*e*¥] < o. 


Ei 


To prove that eek axe, 
Let hX = Y. Therefore, re-arranging the terms, we need to prove that eY — YeY < 1. Or 
equivalently, it is enough to prove that, g/Y) = eY (Y — 1) > —1. 
g(Y) has a minima at Y = 0, and the minimum value, i.e., g(0) = -1. 
> eY (Y — 1) > —1. 
Hence proved. 
AX _4 


Define Xn, = — 
limpyo Xn = X i.e. Xn > X point-wise. Since E[X*e**] < 00 is true, when s = h and k = 1, we get 


) Xk == f x"dPx < fe? dPx. 


i[Xe"*] < oo. Since Xn is dominated by Xe"* , E[Xe"*] < œ and limp, jo Xn = X, applying DCT 
we get E[X] = Ellimayo Xp] = Ellimpg @2—2) = limpo E | Ż=] = timp yg H=. Therefore, 
8 + 4 h 4 4 h 
k eX x e”X]— 
[X] = E[lima;o —G—] = limnyo l = : 


Hence proved. 


3. f Y = aX +b, a,b € R, then My (s) = e° Mx (as). For example, X ~ N (0,1), Y = oX +p 
2 
=>Y~N(,02) > My(s) = ee” =, s ER. 


4. If X and Y are independent and Z = X +Y, then Mz(s) = Mx(s)My/(s). 
Proof: Ejes] = E[e’* +s" ] = Efe** eY |=E[es* |E[e%* J. E 


Consider the following examples: 


(a) 


Xı ~ N(m,0?); X2 ~ N(u2,02); and X1, Xə are independent. Z = Xı + Xo; 


22 
E 
Mx,(s) = ee) 
ce he} 
pp les 
Mx, (s) = AC 2 ) 
Mz(s) = Mx,(s)Mx,(s), 
2 (Catus ESD ) 


> Z ~ N(m + p2,01 + 03). 
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(b) Xı ~ exp(u); X2 ~ exp(A), A Æ p and X1, X2 are independent. Z = X1 + X2; 


Mx,(s) = = a 
Mx,(s) = x 2 z 
Mz(s) = Mx,(s)Mx,(s), 
= le a = ROC is s< min(A, p) 
+ fete) = (Eyam — nee 


N 
5. Z= J X;, X; are iid and N is independent of X;. 
i=1 


g= 


Mz(s) =Eļe°] = E|E[e*7|N]], 


If we write in terms of the PGF and MGF of N, then, 


Gn(Mx(s)), 
= My(log Mx(s)). 


Mz(s) 


II 


N 
For example, X; ~ exp(u); N ~ Geom(p) and Z = )> X;. Then the distribution of Z is computed as 


i=1 
follows: 
Mx(s) = —, s<p, 
bs 
Gs = ooo <ii 
Mz(s) = Gy(Mx(s)), 


1-(1-p) (74) 


= P s<up, 
Up —s 


>Z ~ exp(up). 


25.2 Exercise 


1. (a) [Dimitri P.Bertsekas] Find the MGF associated with an integer-valued random variable X that 
is uniformly distributed in the range {a,a + 1, ..., b}. 
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(b) [Dimitri P.Bertsekas] Find the MGF associated with a continuous random variable X that is 
uniformly distributed in the range [a,b]. 


2. [Dimitri P.Bertsekas] A non-negative interger-valued random variable X has one of the following MGF: 


( 
( 


(a) Explain why one of the 2 cannot possibly be a MGF. 
(b) Use the true MGF to find P(X = 0). 


) = ere" 8-1), 


2(e* —1)_ 


EnS 
S 8 
One 


)=e 


3. Find the variance of a random variable X whose moment generating function is given by 


Mx (s) = ese" —3 
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Lecture 26: Characteristic Functions 


Lecturer: Dr. Krishna Jagannathan Scribe: Aseem Sharma and Ajay M. 


The characteristic function of a random variable X is defined as 


CG) = Lje”**] 
=E[cos(tX )] + iE[sin(txX )], 


which can also be written as 


AE J cite Py. 


If X is a continuous random variable with density function fx (x), then 


AOE i el Fy (a) de. 
The advantage with the characteristic function is that it always exists, unlike the moment generating function, 
which can be infinite everywhere except s = 0. 
Example 1: Let X be an exponential random variable with parameter u. Find its characteristic function. 


Solution: Recall that for an exponential random variable with parameter u, fx (x) = pe™””. Thus, we have 


We have evaluated the above integral essentially by pretending that u — it is a real number. Although this 
happens to produce the correct answer in this case, the correct method of evaluating a characteristic function 
is by performing contour integration. Indeed, in the next example, it is not possible to obtain the correct 
answer by pretending that it is a real number (which is not). 


Example 2: Let X be a Cauchy random variable. Find its characteristic function. 


Solution: The density function for a Cauchy random variable is 


1 
RO = Ta 
Therefore, 
Cx(t) = DE d 
x(t) / m1 + 22) a 
=e7 ltl, 


The above expression is not entirely trivial to obtain. Indeed, it requires considering two separate contour 
integrals for t > 0 and ¢ < 0, and invoking Cauchy’s residue theorem to evaluate the contour integrals. 
(For details, see http://www.wpressutexas.net/forum/attachment.php?attac hmentid=408& d=1296667390.) 
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However, it is also possible to obtain the characteristic function of the Cauchy random variable by invoking 
a Fourier transform duality trick from your undergraduate signals and systems course. (Do it!) 


Recall also that the moment generating function of a Cauchy random variable does not converge anywhere 
except at s = 0. On the other hand, we find here that the characteristic function for the Cauchy random 
variable exists everywhere. This is essentially because the integral defining the chracteristic function con- 
verges absolutely, and hence uniformly, for all t € R. Characteristic functions are thus particularly useful in 
handling heavy-tailed random variables, for which the corresponding moment generating functions do not 
exist. 


Let us next discuss some properties of characteristic functions. 


26.1 Properties of characteristic functions 


26.1.1 Elementary properties 
1) If Y=aX +b, Cy(t) = Cx (at). 
2) If X and Y are independent random variables and Z = X +Y, then Cz(t) = Cx (t)Cy (t). 


3) If Mx(s) < œ for s € [—e,€], then Cx (t) = Mx(it) for allt € R. 
Example 3: Let X ~ NV(0,1). The moment generating function is 


s2 
Mx(s)=e7. 
Then, the characteristic function is 
mia 
Cx(t) = Mx(it) =e2. 
For a non-standard Gaussian, Y ~ N (u,o?), we can now invoke property 1) and conclude that Cy(t) = 
exp (int = zÈ) . 
26.1.2 Defining properties 


Theorem 26.1 A characteristic function Cx (t) satisfies the following properties: 


1) Cx (0) = 1 and |Cx (t)| < 1, Yt € R. 


2) Cx (t) is uniformly continuous on R, i.e., Vt € R, da w(h) | 0 as h > 0 such that 


|Cx(t +h) — Cx(t)| < y(h). 


3) Cx(t) is a non-negative definite kernel, i.e., for any n, any real t1,t2,...,tn, and any complex 
21, Z2,+--,2%n, we have 


5 z;Cx (tj _ tk)Zk >0. 
j,k 


Proof: 


Lecture 26: Characteristic Functions 26-3 


OAE | fet aP | < fle"|aPx zi, 


Tje th)X] = zje] = ee as 1)]| 


< E[|e""* 2 1|]. 


Let |et”¥ — 1| = y(h) and E[y(h)] = (h). We now need to show that ~(h) | 0 as h | 0. Note that 
y(h) > 0 as h —> 0. Further, 


(h) = |e™* — 1| 
= y (cos (hX) — 1)” + (sin (hX))* 
= y2 — 2 cos (hX) 


hX 
= ?2 si — 
"o 


<2. 


Since y(h) is bounded above by 2, applying DCT, we thus have w(h) > 0 as h > 0. 


= ED) ae ue) 


j,k 


>E laer 


j 


>0. 


The significance of 3) may not be apparent at a first glance. However, these three properties are considered 
as the defining properties of a characteristic function, because these properties are also sufficient for an 
arbitrary function to be the characteristic function of some random variable. This important result is known 
as Bochner’s theorem, which is beyond our scope. 


Theorem 26.2 (Bochner’s theorem) A function C(-) is a characteristic function of a random variable 
if and only if it satisfies the properties of theorem 26.1. 
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26.2 Inversion Theorems 


The following inverse theorems are presented without proof, since the proofs require some sophisticated 
machinery from harmonic analysis and complex variables. Essentially, they state that the CDF of a random 
variable can be recovered from the characteristic function. 


Theorem 26.3 


(i) Let X be a continuous random variable, having a probability density function fx(a) and the corre- 
sponding characteristic function be 


Cx(t) a e" fx (a)dx. (26.1) 
The probability density function, fx(x) can be obtained from the characteristic function as 
1 T 
PEETA —itx 
fx(x) = 57 im, fe Cx (t)dt, (26.2) 


at every point where fx (x) is differentiable. 


(ii) The sufficient (but not necessary) condition for the existence of a probability density function is that 
the characteristic function should be absolutely integrable, i.e., 


J CxO < o. (26.3) 


(iii) Let Cx(t) be a valid characteristic function of a random variable X with a cumulative distribution 
function Fx (x). We define, 


A 1 
Fy(x) = 5 (rx) + lim Fx w) for some y, (26.4) 
yt 
then 
a i f 1 co ew tat _ e7 bt 
Fx (b) — Fx (a) = jim F J Fx (tat Va and b. (26.5) 


In part (iii) above, the function Fy (x) coincides with the CDF Fy (zx) at all points where the CDF is con- 
tinuous. At points of discontinuity, it is easy to see that Ê y(x) takes the value at the mid-point of the 
right and left limits of the CDF. Equation (26.5) says that the function Fx (x) can be recovered from the 
characteristic function. Finally, since the CDF is right-continuous, we can recover Fy (x) from F'x (x). 


26.3 Moments from the Characteristic Function 


Theorem 26.4 


k 
(i) Let X be a random variable having a characteristic function Cx (t). If Pex) exists at t = 0, then 


Lecture 26: Characteristic Functions 


(a) E[|X*|] < œ when k is even. 
(b) E[|X* — 1|] < œ when k is odd. 


(ii) If E[|X*|] < œ, then 


Further, 


k glx 
cx) = ay so), 


j=0 


where the error, O (t*) means that O (t*) / (t*) —>0ast—>0. 
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(26.6) 


(26.7) 


Note: Since Cx(t) = f e”*"dPx converges uniformly, we are justified in ‘taking the derivative inside the 


integral.’ 


26.4 Exercise: 


1. [Papoulis] Use characteristic function definition to find the distribution of Y = aX, if X is Gaussian 


with zero mean and variance oĉ. 


2. [Papoulis] Use characteristic function definition to find the distribution of Y = sin(X), if X is uniformly 


distributed in (—1/2, 7/2). 
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Lecture 27: Concentration Inequalities 
Lecturer: Dr. Krishna Jagannathan Scribe: Arjun Nadh 


A concentration inequality is a result that gives us a probability bound on certain random variables taking 
atypically large or atypically small values. While concentration of probability measures is a vast topic, we 
will only discuss some foundational concentration inequalities in this lecture. 


27.1 Markov’s Inequality 


If X is a non-negative random variable, with ELX] < oo, then for any a > 0, 


3| X 
P(X >a)< HXI, 
a 
Clearly, this inequality is meaningful only when a > E[X]. 
Proof: 
` (a) x 
[X] = E[Xlix<a}] + E[XIix>a}, 


O 
> E[XIix>a}l, 


> aP(X >a). 


where (a) follows from linearity of expectations. Since X is a non-negative random variable 
‘|X, x<q}] = 0 and thus (b) follows. | 


Markov Inequality is probably the most fundamental concentration inequality, although it is usually quite 
loose. After all, the bound decays rather slowly, as 1/a. Tighter bounds can be derived under stronger as- 
sumptions on the random variable. For example, when the variance is finite, we have Chebyshev’s inequality. 


27.2 Chebyshev Inequality 


If X is a random variable with expectation u and variance o? < ov, then 


1 
P(| X -u |> ko) < Fa, k>0. 
This can also be written as 
g2 
P X-u >) e>0 
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Proof: The proof follows by applying Markov’s inequality to the non-negative random variable | X — p |?. 


PX —nP> (boy) < SLED) 
(ko)? 
7 1 
sal 
>P(X-ul> (ko) < zy. 


Note that the Chebyshev’s bound decays as 1/k?, an improvement over the basic Markov inequality. As one 
might imagine, exponentially decaying bounds can be derived by invoking the Markov inequality, as long as 
the moment generating function exists in a neighbourhood of the origin. This result is known as the Chernoff 
bound, which we present briefly. 


27.3 Chernoff Bound 


Let Mx(s) = E[e**] and assume that Mx(s) < oo for s € [—¢, e] for some e > 0. Then 
P(X > a) oe, 


where A* (a) = sup(sa — log Mx(s)). 
s>0 


Proof: For any s > 0, 


By Markov’s Inequality, 


[e**] 
P(X < 
(x > a) < ŽE, 
P(X >a) < Mx(s)e°*, Vs > 0 and s € Dx, (27.1) 


where Dx = {s | Mx(s) < co}. 


In (27.1), note that the bound decays exponentially in a for every s > 0 belonging to Dx. The tightest such 
exponential bound is obtained by infimising the right hand side: 


P(X >a)< inf Mx(s)e**, 


—sup(sa—log Mx (s)) 
=e s>0 ; 


Thus 


This gives us an exponentially decaying bound for the ‘positive tail’ P(X > a). Similarly we can prove a 
Chernoff bound for the negative tail P(X < a) by taking s < 0. 
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27.4 Exercise 


1. Let X1, X2, ..., Xn be iid. random variables with PDF fx. Then the set of random variables X4, 
Xə, ..., Xn is called a random sample of size n of X. The sample mean is defined as 


==. 4 
Xn = (Xi + Xo +... + Xn). 


n 
Let X1, X2,.. Xn be a random sample of X with mean pu and variance o?. How many samples of X 


are required for the probability that the sample mean will not deviate from the true mean u by more 
than 0/10 to be at least .95? 


2. A biased coin, which lands heads with probability 5 each time it is flipped, is flipped 200 times 
consecutively. Give an upper bound on the probability that it lands heads at least 120 times. 


3. A post-office handles 10,000 letters per day with a variance of 2,000 letters. What can be said about 
the probability that this post office handles between 8,000 and 12,000 letters tomorrow? What about 
the probability that more than 15,000 letters come in? 
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Lecture 28: Convergence of Random Variables and Related Theorems 
Lecturer: Dr. Krishna Jagannathan Scribe: Gopal, Sudharsan, Ajay, Swamy, Kolla 


An important concept in Probability Theory is that of convergence of random variables. Since the important 
results in Probability Theory are the limit theorems that concern themselves with the asymptotic behaviour 
of random processes, studying the convergence of random variables becomes necessary. We begin by recalling 
some definitions pertaining to convergence of a sequence of real numbers. 


Definition 28.1 Let {x,,n > 1} be a real-valued sequence, i.e., a map from N to R. We say that the 
sequence {xn} converges to some x ER if there exists an no E N such that for alle > 0, 


|En — x| < €, Y n > no. 


We say that the sequence {£n} converges to +00 if for any M > 0, there exists an no E€ N such that for all 
n > no, £n > M. 

We say that the sequence {£n} converges to —oo if for any M > 0, there exists an no E€ N such that for all 
n > no, £n < —-M. 


We now define various notions of convergence for a sequence of random variables. It would be helpful to recall 
that random variables are after all deterministic functions satisfying the measurability property. Hence, the 
simplest notion of convergence of a sequence of random variables is defined in a fashion similar to that for 
regular functions. 


Let (Q, F, P) be a probability space and let {Xn }nen be a sequence of real-valued random variables defined 
on this probability space. 


Definition 28.2 [Definition 0 (Point-wise convergence or sure convergence)] 
A sequence of random variables {Xn }nen is said to converge point-wise or surely to X if 


Xn(w) > X(w), Vweo. 


Note that for a fixed w, {Xn(w)}nen is a sequence of real numbers. Hence, the convergence for this sequence 
is same as the one in definition (28.1). Also, since X is the point-wise limit of random variables, it can 
be proved that X is a random variable, i.e., it is an F-measurable function. This notion of convergence 
is exactly analogous to that defined for regular functions. Since this notion is too strict for most practical 
purposes, and neither does it consider the measurability of the random variables nor the probability measure 
P(-), we define other notions incorporating the said characteristics. 


Definition 28.3 [Definition 1 (Almost sure convergence or convergence with probability 1)] 
A sequence of random variables {Xn}nen is said to converge almost surely or with probability 1 (denoted by 
a.s. or w.p. 1) to X if 


P ({w|Xn(w) > X(w)}) = 1. 
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Almost sure convergence demands that the set of w’s where the random variables converge have a probability 
one. In other words, this definition gives the random variables “freedom” not to converge on a set of zero 
measure! Hence, this is a weakened notion as compared to that of sure convergence, but a more useful one. 


In several situations, the notion of almost sure convergence turns out to be rather strict as well. So several 
other notions of convergence are defined. 


Definition 28.4 [Definition 2 (convergence in probability)| 
A sequence of random variables {Xn}nen is said to converge in probability (denoted by i.p.) to X if 


lim P(|X, —X|>e)=0, Ve>0. 
noo 


As seen from the above definition, this notion concerns itself with the convergence of a sequence of proba- 
bilities! 


At the first glance, it may seem that the notions of almost sure convergence and convergence in probability 
are the same. But the two definitions actually tell very different stories! For almost sure convergence, we 
collect all the w’s wherein the convergence happens, and demand that the measure of this set of w’s be 
1. But, in the case of convergence in probability, there is no direct notion of w since we are looking at a 
sequence of probabilities converging. To clarify this, we do away with the short-hand for probabilities (for 
the moment) and obtain the following expression for the definition of convergence in probability: 


Jim P ({w||Xn(w) — X(w)| >e}) =0, YVe>0. 


Since the notion of convergence of random variables is a very intricate one, it is worth spending some time 
pondering the same. 


Definition 28.5 [Definition 3 (convergence in r** mean)] 


A sequence of random variables {Xn}nen is said to converge in r? mean to X if 


lim E[|X, — X|"] =0. 
noo 


In particular, when r = 2, the convergence is a widely used one. It goes by the special name of convergence 
in the mean-squared sense. 


The last notion of convergence, known as convergence in distribution, is the weakest notion of convergence. 
In essence, we look at the distributions (of random variables in the sequence in consideration) converging 
to some distribution (when the limit exists). This notion is extremely important in order to understand the 
Central Limit Theorem (to be studied in a later lecture). 


Definition 28.6 [Definition 4 (convergence in distribution or weak convergence)| 
A sequence of random variables {Xn}nen is said to converge in distribution to X if 


lim Fy, (4) = Fx(x), VaeR where Fx(-) is continuous. 
n—->co 


Lecture 28: Convergence of Random Variables and Related Theorems 28-3 


That is, the sequence of distributions must converge at all points of continuity of Fx (-). Unlike the previous 
four notions discussed above, for the case of convergence in distribution, the random variables need not be 
defined on a single probability space! 


Before we look at an example that serves to clarify the above definitions, we summarize the notations for 
the above notions. 


(1) Point-wise Convergence: Xn 2 X. 

(2) Almost sure Convergence: Xn 25 X or Xn pane @ 

(3) Convergence in probability: Xn EB ae 

(4) Convergence in r mean: X, — X. When r = 2, X, “Ss X. 


(5) Convergence in Distribution: Xn ry, 


Example: Consider the probability space (Q, F, P) = ([0, 1], B([0, 1]), A) and a sequence of random variables 
{Xn n > 1} defined by 


A n, ifwe [0,4], 
n\W) = 
0, otherwise. 


Since the probability measure specified is the Lebesgue measure, the random variable can be re-written as 


x =l™ with probability i 
” |0, with probability 1 — +. 


Clearly, when w # 0, lim Xn (w) = 0 but it diverges for w = 0. This suggests that the limiting random 
noo 


variable must be the constant random variable 0. Hence, except at w = 0, the sequence of random variables 
converges to the constant random variable 0. Therefore, this sequence does not converge surely, but converges 
almost surely. 

For some e > 0, consider 


ee eS: a asm): 
1 
= lim (=). 
n> Co n 
= 0. 


Hence, the sequence converges in probability. 
Consider the following two expressions: 


n> o0 


lim S [|XnI7] = lim (e X > +0) ; 
n 
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1 
lim E[|X,|] = lim (w x = +0) , 
n= n—- oo n 
= A; 
Since the above limits do not equal 0, the sequence converges neither in the mean-squared sense, nor in the 
sense of first mean. 


Considering the distribution of X,,’s, it is clear (through visualization) that they converge to the following 
distribution: 


0, ifr<0 
F — P 2 
x(2) Y otherwise. 


Also, this happens at each x Æ 0 i.e. at all points of continuity of F'x(-). Hence, the sequence of random 
variables converge in distribution. 


So far, we have mentioned that certain notions are weaker than certain others. Let us now formalize the 
relations that exist among various notions of convergence. 


It is immediately clear from the definitions that point-wise convergence implies almost sure convergence. 
Figure (28.1) is a summary of the implications that hold for any sequence for random variables. No other 
implications hold in general. We prove these, in a series of theorems, as below. 


p.w. RERS 
ip.=>D 


| A 


pth mean (r > 1I s 


th mean (r >s > 1) 


Figure 28.1: Implication Diagram 


Theorem 28.7 X, —> X => X, 3X, Vr. 


Proof: Consider the quantity lim P(|X, — X| > €). Applying Markov’s inequaliy, we get 
noo 


2 [| Xn — X|" 
lim P(|X,-X|>e) < lim BK SA Ve > 0, 
n—0o n= ET 
1) 0, 
where (a) follows since Xp + X. Hence proved. | 


Theorem 28.8 Xn ir X = Xn D X. 


Proof: Fix an e > 0. 


Fy, (2) = P(X, < 2), 
P(X, <£, X <a+e)+P(X, <r, X >2x+ 6), 
< Fylato) + P(X, —X|>e). 
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Similarly, 
Fx(a-—€) =P(X <a- 6), 
=P(X <a-6,X, <x) +P(X <z-—e€,Xn >£), 
< Fx, (a) + P(|Xn — X| > ©). 
Thus, 


Peo €) — P(|Xn — X| > €) < Fx„(x) < Peete) + P(Xn — X| > ©). 


As n — œ, since Xn Ba X, P(|Xn — X| > €) > 0. Therefore, 


Fx(a@-e)< lim inf Fx, (x) < lim sup Fx, (x) < Fx(x +€), Ve > 0. 
NTO noo 
If F is continuous at x, then Fx (a — e€) + Fx(x) and Fx (a + €) | Fx(x) as e | 0. Hence proved. a 


Theorem 28.9 X, > X => X, > X, ifr>s>1. 


Proof: From Lyapunov’s Inequality [1, Chapter 4], we see that (E[|X, — x| < (E[|X, — X ,r> 
s > 1. Hence, the result follows. a 


Theorem 28.10 X,, ie X =Æ Xn => X in general. 


Proof: Proof by counter-example: 

Let Xn be an independent sequence of random variables defined as 
3 1 

Xn = { oh w.p. nz? 


, W.p. 1-4. 


Then, P(|X,| > €) = $ for large enough n, and hence Xn +P; 0. On the other hand, E[|X,,|] = n, which 
diverges to infinity as n grows unbounded. a 


Theorem 28.11 X,, Dyry = Xn ie, X in general. 


Proof: Proof by counter-example: 
Let X be a Bernoulli random variable with parameter 0.5, and define a sequence such that X;=X V i. Let 


Y=1 — X. Clearly, X; BY But, |X; — Y|=1, V i. Hence, X; does not converge to Y in probability. m 


Theorem 28.12 X, iP, y =Æ Xn 5 X in general. 


Proof: Proof by counter-example: 
Let {Xn} be a sequence of independent random variables defined as 
_ J 1, wp: L, 

kil 0, w.p. 1-4. 
lim P(|Xp| >) = lim P(X, =1) = lim 2 = 0. So, Xp 230. 
noo noo n—> o0 
Let A, be the event that {X, = 1}. Then, A,,’s are independent and 5> P(An) = co. By Borel-Cantelli 

n=1 


Lemma 2, w.p. 1 infinitely many A,,’s will occur, i.e., {Xn = 1} i.o.. So, Xn does not converge to 0 almost 
surely. E 
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Theorem 28.13 X, > X = Xn — X ifr > s> 1 in general. 


Proof: Proof by counter-example: 
Let {Xn} be a sequence of independent random variables defined as 


Hence, E[|X$|] =n? —> 0. But, E[|X7|] = n — ov. E 


Theorem 28.14 X„ 23 X = Xn =. X in general. 


Proof: Proof by counter-example: 
Let {Xn} be a sequence of independent random variables defined as 


[X2]=4. So, X, “5 0. As seen previously (during the proof of Theorem (28.12)), Xn does not converge 
to 0 almost surely. ] 


Theorem 28.15 X, 25 X = Xn =$ X in general. 
Proof: Proof by counter-example: 
Let {Xn} be a sequence of independent of random variables defined as 


_ , w€ (0, +), 
Xn(w) = 0, otherwise. 


We know that Xn converges to 0 almost surely. E[X2]=n —> œ. So, Xp does not converge to 0 in the 
mean-squared sense. a 


Before proving the implication X, “3 X — X, -P; X, we derive a sufficient condition followed by a 
necessary and sufficient condition for almost sure convergence. 


Theorem 28.16 If V e > 0, )>P(An(€)) < œ, then Xn 5 X, where An(e) = {w : |Xn(w) — X(w)| > e}. 


Proof: By Borel-Cantelli Lemma 1, An (€) occurs finitely often, for any € > 0 w.p. 1. Let By(e) = U Arle). 
n>m 
Therefore, 


< 2 P(A 


So, P(Bm(€)) > 0 as m —> ow, whenever X` P(A (€)) < œ. An equivalent way of proving almost sure 


n 


convergence is to first consider lim P| U {w: |Xn(w) — X(w)| > a) = 0, Y € > 0. Hence, P(Bm(€)) > 0 
m— oo n>m 


as m — oo. This implies almost sure convergence. 
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Theorem 28.17 X, 25 X iff P(Bm(6)) + 0 as m > œ, Ve>0. 


Proof: 


Let A(e) ={ w EN: w E€ Ap(e) for infinitely many values of n } = N U Arle). 


If Xn 5 X, then it is easy to see that P(A(e))=0, V € > 0. 
Then, P ( N Bno) =0. 


m=1 
Since {B,,(e)} is a nested, decreasing sequence, it follows from the continuity of probability measures that 
lim P(B,,(e)) = 0. 
m—>co 


Conversely, let C={ w € 2: Xn(w) > X(w) as n > co}. Then, 


P(C’) =P (U ao) 
e>0 


Ua (2)) as A(e) CAC) ife > 


IA 
Me 
iam] 
aN >r 


Also, P(A(e)) = lim P(B,,(€))=0. Consider 


r{fa()) 
- ae (C8) 


= 0. V k > 1, by assumption 


Fg 
ions 
D 
P NY 
x| = 
NE 
So 
II 


So, P(C°) = 0. Hence, P(C) = 1. E 


Corollary 28.18 X, “> X => X, FR 


Proof: X, “> X = lim P(Bm(€))=0. 
m—>co 
As Am(e) C By(e), it is implied that lim P(A,,(e))=0. 
m= oo 


Hence, Xn +2; X, a 
Theorem 28.19 If Xn = X, then there exists a deterministic, increasing subsequence n1,N2,N3,... such 


that Xm I A as i—oo. 


Proof: The reader is referred to Theorem 13 in Chapter 7 of [1] for a proof. | 


Example: Let {Xn} be a sequence of independent random variables defined as 
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It is easy to verify that, X, +5 0, but X,, “4 X. However, if we consider the subsequence {X1, X4, Xo,...}, 
this (sub)sequence of random variables converges almost surely to 0. This can be verified as follows. 

Let nj = 77, Y; = Xn, = Xz. 

Thus, P(Y; = 1) = P(X; = 1) = 2: 

=> È P(Y) = 4 $ < œ. Hence, by BCL-1, X? 4$ 0. 


iEN iEN 
Although this is not a proof for the above theorem, it serves to verify the statement via a concrete example. 


Theorem 28.20 [Skorokhod’s Representation Theorem] 
Let {Xn,n > 1} and X be random variables on (Q,F,P) such that Xn converges to X in distribution. Then, 
there exists a probability space (Q’, F’,P’), and random variables {Y,,,n > 1} and Y on (Q', F',P’) such that, 


a) {Ynn > 1} and Y have the same distributions as {Xn,n > 1} and X respectively. 
b) Y, “3 Y as n> oo. 


Theorem 28.21 [Continuous Mapping Theorem] 
If Xn > X, and g :R — R is continuous, then g(Xn) ni g(X). 


Proof: By Skorokhod’s Representation Theorem, there exists a probability space (Q’, F', P’), and {Y,,n > 
1}, Y on (Q, F’,P’) such that, Yp =$ Y. Further, from continuity of g, 
{w EY | g(¥n(w)) > gY (w))} 2 {w €" | Yanl) > Y(w)}, 


> P({jw E V | g(¥nw)) > IY W) H 2 Pw E Y | Yn) > Y(w)}), 
= Pw E X | 9(¥nw)) + oY (w))}) 2 1, 
= g(Y;, ohn WY), 
= 9(¥n) 4 g(Y). 
This completes the proof since, g(Y;,) has the same distribution as g(X,,), and g(Y) has the same distribution 
as g(X). a 


Theorem 28.22 X, 3 X iff for every bounded continuous function g : R — R, we have E[g(X,)] > 
[g(X)]. 


Proof: Here, we present only a partial proof. For a full treatment, the reader is referred to Theorem 9 in 
chapter 7 of [1]. 

Assume X,, A X. From Skorokhod’s Representation Theorem, we know that there exist random variables 
{Ynn > 1} and Y, such that Y, =$ Y. From Continuous Mapping Theorem, it follows that g(Y,) “3 g(Y), 
since g is given to be continuous. Now, since g is bounded, by DCT, we have E[g(Y,,)] > E[g(Y)]. Since, 
g(Yn) has the same distribution as g(X,,), and g(Y) has the same distribution as g(X), we have E[g(X,,)] > 


[g(X)]. m 


Theorem 28.23 If Xn > X, then Cx, (t) — Cx (t), Vt. 


Proof: If Xn Loy , from Skorokhod’s Representation Theorem, there exist random variables {Y,,} and Y 
such that Y, “SY. 
So, 

cos(Y,t) —> cos(Y t), cos(X,,t) —> cos(Xt), Vt. 
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As cos(-) and sin(-) are bounded functions, 


i[cos(Y,,t)] + iE[sin(Y,,t)] —> E[cos(Yt)] + iE[sin(Yt)], Vt. 


=> Cy, (t) — Cy (t), Ve. 


We get, 

Cx, (t) — Cx(t), Vt, 
since distributions of {X,,} and X are same as those of {Y„} and Y respectively, from Skorokhod’s Repre- 
sentation Theorem. YT] 


Theorem 28.24 Let {Xn} be a sequence of RVs with characteristic functions, Cx, (t) for each n, and let 
X be a RV with characteristic function Cx (t). If Cx,, (t) — Cx (t), then Xn ex 


Theorem 28.25 Let {Xn} be a sequence of RVs with characteristic functions Cx, (t) for each n, and 
suppose lim Cy, (t) exists Vt, and is denoted by ¢ (t). Then, one of the following statements is true: 
noo 
(a) (-) is discontinuous at t = 0, and in this case, Xn does not converge in distribution. 


(b) o(-) is continuous at t = 0, and in this case, ¢ is a valid characteristic function of some RV X. Then 
D 
Xn — X. 


Remark 28.26 In order to prove that the $(t) above is indeed a valid characteristic function, we need to 
verify the three defining properties of characteristic functions. However, in the light of Theorem (28.25), it 
is sufficient to verify the continuity of d(t) att =0. After all d(t) is not an arbitrary function; it is the limit 
of the characteristic functions of Xns, and therefore inherits some nice properties. Due to these inherited 
properties, it turns out it is enough to verify continuity at t = 0, instead of verifying all the conditions of 
Bochner’s theorem! 


Note: Theorems (28.24) and (28.25) together are known as Continuity Theorem. For proof, refer to [1]. 


28.1 Exercises 


1. (a) Prove that convergence in probability implies convergence in distribution, and give a counter- 
example to show that the converse need not hold. 


(b) Show that convergence in distribution to a constant random variable implies convergence in prob- 
ability to that constant. 


2. Consider the sequence of random variables with densities 


fx, (x) = 1 - cos(27nx), x € (0,1). 
Do X,’ s converge in distribution? Does the sequence of densities converge? 


3. [Grimmett] A sequence {Xn, n > 1} of random variables is said to be completely convergent to X if 


Xn P(Xn — X| > €) < œ, Ye > 0 
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Show that, for sequences of independent random variables, complete convergence is equivalent to almost 
sure convergence. Find a sequence of (dependent) random variables that converge almost surely but 
not completely. 


4. Construct an example of a sequence of characteristic functions ¢,(t) such that the limit ¢(t) = 
limpo Gn (t) exists for all t, but (t) is not a valid characteristic function. 
References 
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Lecture 29: The Laws of Large Numbers 
Lecturer: Dr. Krishna Jagannathan Scribe: Ravi Kolla, Vishakh Hegde and Arjun Bhagoji 


In this lecture, we study the laws of large numbers (LLNs), which are arguably the single most important 
class of theorems, which form the backbone of probability theory. In particular, the LLNs provide an intuitive 
interpretation for the expectation of a random variable as the ‘average value’ of the random variable. In the 
case of i.i.d. random variables that we consider in this lecture, the LLN roughly says that the sample average 
of a large number of i.i.d. random variables converges to the expected value. The sense of convergence in 
the weak law of large numbers is convergence in probability. The strong law of large numbers, as the name 
suggests, asserts the stronger notion of almost sure convergence. 


29.1 Weak Law of Large Numbers 


The earliest available proof of the weak law of large number dates to the year 1713, in the posthumously 
published work of Jacob Bernoulli. It asserts convergence in probability of the sample average to the expected 
value. 


Theorem 29.1 (Weak Law of Large numbers) Let X1, X2,... be i.i.d random variables with finite mean, 


u[X]. Let Sn = >> Xi. Then, 


t=1 


Bn i2 BLY]. 


n 


Proof: First, we give a partial proof by assuming the variance of X to be finite i.e., o% < oo. Since X;’s 
2 
are iid, E[Sn] = nE[X], Var (Sn) = nVar(X) > E [S] =E[X], Var (S) = =. 


S Var (22 
lim P (|= —E[X] | > e) < lim va (By Chebyshev’s Inequality), 
n= oo n n—> o0 € 
2 
= lim 2; 
n> NE 
= 0. 
a 
Next, we give a general proof using characteristic functions. 
Proof: Assume that X; (where i = 1,2,...,n,...) are iid random variables. The characteristic function 


of X; be Cx, (t) = Cx(t) for any i € {1,2,...,n}. Let Sn = Xı + X2 +... Xn be the sum of these n i.id 
random variables. The following can be easily verified: 


Cs, = [Cx (t)]" = her); 
itnSn 


=Eļje » ], 
= Cs (nt). 


Pn 
n 
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This implies that, 


As n — œ, we have, 


Ca Gee Vier: 


n 


Note that, e*#[*lt is a valid characteristic function. In fact, it is a characteristic function of a constant 
random variable which takes the value E[X]. From the theorem on convergence of characteristic functions, 
we have 


Sn 
2r P, EjX]. 
n 

Since E[X] is a constant, we have, 
Sn i 
EL. 
n 


29.2 Strong Law of Large Numbers 


The Strong Law of Large Numbers (SLLN) gives us the condition when the sample average (2) converges 
almost surely to the expected value. 


S 


Theorem 29.2 If {X;,i > 1} is a sequence of i.i.d RVs with E[|X;|] < œ, then 2» =$ E[X], ie., 
P (w| Sa) > E[X]) =1. 


Here, S;,(w) is just Xı (w) + Xo(w)---+Xp(w). Thus, for a fixed w € Q, {Se n> 1} is a sequence of real 


numbers. Then, there are the following three possibilities regarding the convergence of this sequence: 


1. The sequence does not converge as n — oo. 


2. The sequence 


converges to a value other than ELX], as n > oo. 


3. The sequence converges to ELX] as n — co. 

The SLLN asserts that the set of w € Q where the third possibility holds has a probability of 1. Also, 
the SLLN implies the WLLN because almost sure convergence implies convergence in probability. From 
Theorem 28.16, we obtain another way of stating the SLLN as given below 


jim P U fu: Se- 1x] >e) =0, Ve>0. (29.1) 


m>n 


A general proof of the SLLN is rather long, so we will restrict ourselves to two partial proofs, each of which 
makes a stronger assumption than needed about the moments of the random variable X. 


1Recall that convergence in probability is equivalent to convergence in distribution, when the limit is a constant. 
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29.3 Partial Proof 1 (assuming finite fourth moment) 


Proof: Assume E[X#] = 7 < œ and without loss of generality, E[X] = 0. The second assumption is not 


crucial. We want to show that Sa +> 0. 


Now, 


[Sa] = E[(X, + X2+---X,)4], 
= nn + (3) (5) BX?) (29.2) 


< nn + 3n?ot. (29.3) 


In (29.2), the coefficient of 7 is n because there are n terms of the form Xf. Terms of the form X?X;j are 
not present as our assumption that E[X] = 0 ensures that these terms go to zero. For the other surviving 
terms of the form X?X 2 , the coefficient arises because there are (3) ways to choose the distinct indices t 
and j, after which one can choose X; from 2 out of the 4 terms being multiplied together, in which case X; 
will come from the other two terms. 


Now, we make use of the Markov inequality and substitute the inequality for E [94] from (29.3). 


Sal [s$ 
(| >) < [Sal 
n nte 
nn + 3n?o4 
nte 
n Sa 
; 29.4 
nee n?e (29:4) 
Then, from (29.4), 
co Sn 4 co n 304 
P| |— < —— + —— ; 29. 
> (£ DDT (29.5) 


29.4 Partial Proof 2 (assuming finite variance) 


Assume o? < œ and E[X] = u. We begin by proving the SLLN for X; > 0. From the partial proof of the 
Weak Law of Large Numbers, we have 


2 
p (|2 -a >e) es (29.6) 
n NE 
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To obtain a.s. convergence, consider a deterministic subsequence n; = i? ,i > 1. Thus we get, 


which implies that 


yr (|e - |> €) <, Ve > 0, 
i 


i=l 


Using Borel-Cantelli lemma 1 we conclude that 


Let n be such that i? < n < (i +1) 


Sj2 a.s. 5 
seg — pb as 1? > CO. 
l 


2. Since X; > 0, 


Siz < Sn < Susi, 
Si o Sng Suitiy? 
(ne er 


Multiplying the expression on the left by i? in both the numerator and denominator, and similarly for the 
expression on the right, except by (i + 1)”, we get 


As i + œ, we have 


Se i z Sy S412 (i+ 1)? 
(i +1) n i2 (i+1) 


AN Ta 


Se P c Sn — San (i+)? 
e a CFP a? 


p Bai 
n 


Thus, by the sandwich theorem, we get 


To generalise to arbitrary RVs with a finite variance, we just write X, = X+ — X; and proceed as above 
since both X+ and X, have a finite variance and are non-negative. 


29.5 Exercises 


1. [Gallager] A town starts a mosquito control program and the random variable Z„ is the number of 


mosquitoes at the end of the n 
-1,n > 1. Assume that {X,,n > 1} is a sequence of iid. random 


the year n ie. Zn = XnZn 
variables with the PMF P(X 


th vear (n = 0, 1, ...). Let X, be the growth rate of mosquitoes in 


= 2) = 5, P(X = §) = į and P(X = §) = 4. Suppose Zo, the initial 


number of mosquitoes, is a known constant and assume, for simplicity and consistency, that Z, can 


take non-integer values. 


(a) Find E[Z,,] and lim, 


[Zn]: 
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(b) Based on your answer to part (a), can you conclude whether or not the mosquito control program 
is successful? What would your conclusion be? 


Let Wn = logo Xn. Find E[W,,] and Eflog2 zl- 


(c 
(d 
(e 
(£ 


Show that there exists a constant a such that limp—oo log: 3 = a almost surely. 


) 
) 
) Show that there is a constant 8 such that limn—oo Zn = 8 almost surely. 

) Based on your answer to part (e), can you conclude whether or not the mosquito control program 
is successful? What would your conclusion be? 


(g) How do you reconcile your answers to parts (b) and (f)? 


2. Imagine a world in which the value of 7 is unknown. It is known that area of a circle is proportional 
to the square of the radius, but the constant of proportionality is unknown. Suppose you are given 
a uniform random variable generator, and you can generate as many i.i.d. samples as you need, 
devise a method to estimate the value of the proportionality constant without actually measuring the 
area/circumference of the circle. 
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Lecture 30: The Central Limit Theorem 
Lecturer: Dr. Krishna Jagannathan Scribes: Vishakh Hegde 


30.1 Central Limit Theorem 


In this section, we will state and prove the central limit theorem. Let {X;} be a sequence of i.i.d. random 
variables having a finite variance. From law of large numbers we know that for large n, the sum Sp is 
approximately as big as nE[X] , i.e., 


on ie, RIX], 
n 
— nE|X| iv. 
= Sn — ME[X] ip. 9 
n 


Thus whenever the variance of X; is finite, the difference Sn — nE[X] grows slower as compared to n. The 
Central Limit Theorem (CLT) says that this difference scales as yn, and that the distribution of ees) 


approaches a normal distribution as n > oo irrespective of the distribution of X;. 


Vn 


Sn — nE[X] 
yn 


~ N(0,0%). 


Theorem 30.1 (Central Limit Theorem) Let {X;} be a sequence of i.i.d. random variables with mean 


|X] and a non-zero variance 0% < œo. Let Zn = Bre nE], Then, we have Zn 2, N 0,1), te., 
x oxyn 


Jim, BE. (2) =~ f et dx, Vz ER. 


SY: 
Proof: Let Y„ = xn Ex] Let Zn = To . It is easy to see that Y, has unit variance and zero mean, i.e., 


[Yn] = 0 and of =1. 


Cy, (t) = 1+ i#E[Y,] + 2i 1 + 0(¢*), 
GOS TEO = Foa”), 
t? s 
= 1 = z + o(t J 
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From the theorem on convergence of characteristic functions, Zn converges to a standard Gaussian in distri- 
bution. 


For example, if X;’s are discrete random variables, the CDFs will be step functions. As n — oo, these 
step functions will gradually converge to the error function (i.e. the steps will gradually decrease to form a 
continuous distribution as n — oo). 


It is also important to understand what this theorem does not say. It is not saying that the probability 
32 

density function converges to Jere? Convergence in density function requires more stringent conditions 

which are stated in the Local Central Limit Theorem. 


Theorem 30.2 (Local Central Limit Theorem) Let X1, X2,... be i.i.d. random variables with zero 
mean and unit variance. Suppose further that their common characteristic function @ satisfies the following: 


IWH) dt < o0. 


— (X14+-Xet+...+Xn) 


for some integer r > 1. The density function gnof Un = exists for n > r, and furthermore 


Jn 
we have, 
(2) l =e? 
zr) > E253 
In x 
as n — co, uniformly in x E€ R. 
Proof: For a proof, refer to Section 5.10 in [1]. 
E 
Let X1, X2,... be iid. random variables with zero mean and unit variance. From CLT, we know that 
SXi 
Un = E is distributed as a standard Gaussian. We now look at yet another interesting result which deals 


with the largest value taken by Um, m > n, for a large n. 


Theorem 30.3 (The Law of the Iterated Logarithm) Let Xı, X2,... be i.i.d. random variables with 


zero mean and unit variance. Also, let Sn = X. X; Then, 
i=l 


P | lim sup SPE E =1)=1. 
n=œ vy2n log logn 


Unlike the CLT which talks about distribution of U, for a large, fixed n, law of iterated logarithm talks 


about the largest fluctuation in Um, for m > n. In particular, it bounds the largest value taken by Um 
beyond n. Formally, the subset of Q for which this holds has a probability measure 1. 
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